Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models

Ann Now\'e; Denis Steckelmacher; Senne Deproost

arxiv: 2605.14897 · v1 · pith:MIGVCYVCnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models

Senne Deproost , Denis Steckelmacher , Ann Now\'e This is my paper

Pith reviewed 2026-06-30 21:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningpolicy distillationexplainable AIVoronoi partitioningcritic-drivenlinear subpoliciesstate space partitioningdeep RL

0 comments

The pith

Critic-driven Voronoi partitioning distills deep RL policies into a small set of linear functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a distillation method that breaks a deep RL policy into regions of the state space, each governed by its own linear function. It uses the critic network's value estimates to decide where to add a new linear piece, adding them only in regions where current performance falls short. This produces an explainable surrogate that assigns each state to a linear subpolicy via nearest-neighbor lookup in a Voronoi diagram. The goal is to reach performance close to the original black-box policy while keeping the number of linear pieces modest. A reader would care because it offers one concrete route to making high-performing RL agents more transparent without relying solely on matching observed actions.

Core claim

The paper claims that Critic-Driven Voronoi State Partitioning can turn a black-box deep RL policy into a collection of linear subpolicies. The method iteratively places new linear functions in state-space regions where the critic reports insufficient value, then uses a Voronoi quantizer with nearest-neighbor assignment to map every state to its nearest linear piece. Gradient descent optimizes each linear function inside its cell. On standard benchmarks the resulting model approximates the original policy's behavior with a reasonable number of such pieces.

What carries the argument

Critic-Driven Voronoi State Partitioning: a Voronoi quantizer that assigns linear functions to state-space regions via nearest-neighbor lookup, with the critic value network used to iteratively introduce new subpolicies where value is insufficient.

If this is right

A modest number of linear functions suffices to approach the original policy's returns.
The method incorporates critic value information rather than minimizing only behavioral distance.
The resulting cell diagram assigns every state to exactly one linear subpolicy through nearest-neighbor lookup.
Gradient descent can optimize the parameters of each linear function inside its assigned region.
The approach applies to multiple standard RL benchmark environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cell diagram might allow direct inspection of which linear piece is active in any given region, aiding manual verification.
The same critic-guided placement rule could be tested with other simple surrogate classes such as shallow trees.
If the number of pieces stays small across more domains, the method could reduce memory needed to store or transmit the policy.

Load-bearing premise

The critic value network supplies a reliable, non-circular signal for where additional linear subpolicies are needed to reduce overall policy complexity.

What would settle it

On the same benchmarks, the method either requires dozens of linear functions to reach the reported performance level or the distilled policy's returns fall substantially below the original policy's returns.

Figures

Figures reproduced from arXiv: 2605.14897 by Ann Now\'e, Denis Steckelmacher, Senne Deproost.

**Figure 2.** Figure 2: The used validation environments. From left to right: SimpleGoal, Moun [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of both the VSP-critic algorithm (green) and the random [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Number of linear subpolicies found by the VSP algorithm (green) and the [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Control policies for the SimpleGoal environment using quiver plots. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Partitioning of both LunarLander (on its x and y position) and Bipedal [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: A set of 3 linear functions for MountainCarContinuous using [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Despite many successful attempts at explaining Deep Reinforcement Learning policies using distillation, it remains difficult to balance the performance-interpretability trade-off and select a fitting surrogate model. In addition to this, traditional distillation only minimizes the distance between the behavior of the original and the surrogate policy while other RL-specific components such as action value are disregarded. To solve this, we introduce a new model-agnostic method called Critic-Driven Voronoi State Partitioning, which partitions a black box control policy into regions where a simple class of model can be optimized using gradient descent. By exploiting the critic value network of the original policy, we iteratively introduce new subpolicies in regions with insufficient value, standing in for a measure of policy complexity. The partitioning, a Voronoi quantizer, uses nearest neighbor lookups to assign a linear function to each point in the state space resulting in a cell-like diagram. We validate our approach on several well known benchmarks and proof that this distillation approaches the original policy using a reasonable sized set of linear functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is using the original critic to iteratively place linear subpolicies via Voronoi cells, but the fixed critic risks misalignment once the surrogate starts changing behavior.

read the letter

The core idea here is a distillation procedure that starts with the original policy's critic and keeps adding linear subpolicies in regions the critic flags as low-value, then assigns each point to the nearest linear function through a Voronoi diagram. That combination of critic-driven placement and Voronoi quantization is the actual novelty; it is not just another behavior-cloning baseline.

It does handle the performance-interpretability trade-off more explicitly than pure imitation methods by bringing in the action-value signal. The abstract claims this gets close to the teacher policy with a modest number of linear pieces on standard benchmarks, which is the kind of concrete result that matters for this subfield.

The soft spot is exactly the one in the stress-test note. The critic stays frozen from the original policy while the surrogate is updated at each step, so its value estimates are no longer aligned with the states the current set of linear functions actually visits. Nothing in the description shows a re-query or correction step, which could cause the partitioning to either add unnecessary cells or leave performance gaps. Without seeing the full experiments and any ablation on critic drift, it is hard to judge how much this matters in practice.

This is aimed at researchers already working on policy distillation or explainable RL. It is worth sending to a serious referee because the technique is well-specified enough to be tested and the claim is falsifiable, even if the current evidence is still thin.

Referee Report

2 major / 2 minor

Summary. The paper introduces Critic-Driven Voronoi State Partitioning, a model-agnostic distillation technique that partitions the state space of a black-box deep RL policy into Voronoi cells, each assigned an optimized linear subpolicy. It iteratively adds new subpolicies in regions flagged by the original policy's critic value network as having insufficient value (serving as a proxy for complexity), with nearest-neighbor assignment yielding the final surrogate. The central claim is that this produces an explainable linear model that approaches the original policy's performance using a reasonable number of components, validated on standard RL benchmarks.

Significance. If the empirical results hold and the critic-based partitioning is shown to be reliable, the method would provide a concrete advance in explainable RL by moving beyond pure behavior cloning to incorporate value information, potentially yielding more compact and interpretable surrogates than existing distillation approaches.

major comments (2)

[Abstract / iterative introduction of subpolicies] Abstract and method description of the iterative procedure: the fixed critic value network is used to identify regions needing additional linear subpolicies, yet no mechanism is described for re-evaluating or correcting the critic under the evolving surrogate's state distribution; this directly risks misalignment and must be addressed to support the claim that the approach 'approaches the original policy'.
[Validation / benchmarks] Validation section (benchmarks and results): the manuscript asserts that the distillation 'approaches the original policy using a reasonable sized set of linear functions' but provides no quantitative details on performance gaps, number of cells used, baselines, or statistical significance in the visible description, leaving the central empirical claim unsupported.

minor comments (2)

[Abstract] Abstract uses 'proof' for what appears to be empirical validation; rephrase to 'demonstrate' or 'show empirically'.
The Voronoi quantizer and linear function assignment lack explicit equations or pseudocode in the high-level description; adding these would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / iterative introduction of subpolicies] Abstract and method description of the iterative procedure: the fixed critic value network is used to identify regions needing additional linear subpolicies, yet no mechanism is described for re-evaluating or correcting the critic under the evolving surrogate's state distribution; this directly risks misalignment and must be addressed to support the claim that the approach 'approaches the original policy'.

Authors: We acknowledge that the manuscript describes the critic as fixed from the original policy and does not detail any re-evaluation step under the surrogate's state distribution. This is a valid concern regarding potential misalignment. In revision we will expand the method section with an explicit discussion of this design choice (the critic serves as a static proxy for the target policy's value landscape) together with a short analysis of the resulting distribution shift; if the analysis indicates material risk we will also add an optional re-evaluation procedure. revision: yes
Referee: [Validation / benchmarks] Validation section (benchmarks and results): the manuscript asserts that the distillation 'approaches the original policy using a reasonable sized set of linear functions' but provides no quantitative details on performance gaps, number of cells used, baselines, or statistical significance in the visible description, leaving the central empirical claim unsupported.

Authors: The full manuscript contains benchmark results and figures, yet we agree that the quantitative support for the central claim is not presented with sufficient clarity or detail. We will revise the validation section to include a summary table reporting performance gaps to the original policy, the number of Voronoi cells, comparisons against behavior-cloning baselines, and statistical significance (means and standard errors over multiple random seeds). revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation uses external critic signal

full rationale

The paper's core construction iteratively adds linear subpolicies in regions flagged by the original policy's fixed critic value network. This critic is an input taken from the pretrained deep RL agent and is not defined in terms of, fitted to, or updated from the surrogate Voronoi-quantized model. No equations or steps reduce a claimed prediction to a fitted parameter by construction, and the provided text invokes no self-citations, uniqueness theorems, or ansatzes from prior author work. The method therefore remains self-contained against an independent external benchmark (the original policy's critic) rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that the critic network's value estimates can be used directly as a complexity signal without additional validation; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption The critic value network accurately indicates regions where the current linear approximation is insufficient.
Invoked when the method iteratively introduces new subpolicies based on value.

pith-pipeline@v0.9.1-grok · 5713 in / 1110 out tokens · 20011 ms · 2026-06-30T21:12:49.501121+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 21 canonical work pages · 5 internal anchors

[1]

and Veness, Joel and Bellemare, Marc G

Volodymyr Mnih et al. “Human-Level Control through Deep Reinforce- ment Learning”. In:Nature518.7540 (Feb. 2015), pp. 529–533.issn: 0028- 0836, 1476-4687.doi:10.1038/nature14236

work page doi:10.1038/nature14236 2015
[2]

Continuous control with deep reinforcement learning

Timothy P. Lillicrap et al.Continuous Control with Deep Reinforcement Learning. https://arxiv.org/abs/1509.02971v6. Sept. 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments

Sindhu Padakandla. “A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments”. In:ACM Computing Surveys54.6 (July 2022), pp. 1–25.issn: 0360-0300, 1557-7341.doi:10.1145/3459991

work page doi:10.1145/3459991 2022
[4]

A Tour of Reinforcement Learning: The View from Con- tinuous Control

Benjamin Recht. “A Tour of Reinforcement Learning: The View from Con- tinuous Control”. In:Annual Review of Control, Robotics, and Autonomous Systems2.1 (May 2019), pp. 253–279.issn: 2573-5144, 2573-5144.doi: 10.1146/annurev-control-053018-023825

work page doi:10.1146/annurev-control-053018-023825 2019
[5]

Explanation in Artificial Intelligence:

Tim Miller. “Explanation in Artificial Intelligence: Insights from the Social Sciences”. In:Artificial Intelligence267 (Feb. 2019), pp. 1–38.issn: 0004- 3702.doi:10.1016/j.artint.2018.07.007

work page doi:10.1016/j.artint.2018.07.007 2019
[6]

Generating Inter- pretable Fuzzy Controllers Using Particle Swarm Optimization and Ge- neticProgramming

Daniel Hein, Steffen Udluft, and Thomas A. Runkler. “Generating Inter- pretable Fuzzy Controllers Using Particle Swarm Optimization and Ge- neticProgramming”.In:Proceedings of the Genetic and Evolutionary Com- putation Conference Companion. Kyoto Japan: ACM, July 2018, pp. 1268– 1275.isbn: 978-1-4503-5764-7.doi:10.1145/3205651.3208277

work page doi:10.1145/3205651.3208277 2018
[7]

Imitation-Projected Programmatic Reinforcement Learning

Abhinav Verma et al. “Imitation-Projected Programmatic Reinforcement Learning”.In:Advances in Neural Information Processing Systems.Vol.32. Curran Associates, Inc., 2019

2019
[8]

Explainable Reinforcement Learning (XRL): A Sys- tematic Literature Review and Taxonomy

Yanzhe Bekkemoen. “Explainable Reinforcement Learning (XRL): A Sys- tematic Literature Review and Taxonomy”. In:Machine Learning(Nov. 2023).issn: 0885-6125, 1573-0565.doi:10.1007/s10994-023-06479-7

work page doi:10.1007/s10994-023-06479-7 2023
[9]

Nicholas Frosst and Geoffrey Hinton.Distilling a Neural Network Into a Soft Decision Tree. Nov. 2017. arXiv:1711.09784 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Policy Distillation

Andrei A. Rusu et al. “Policy Distillation”. In:International Conference on Learning Representations. Vol. abs/1511.06295. San Diego, CA, USA, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Sutton and Andrew Barto.Reinforcement Learning: An Intro- duction

Richard S. Sutton and Andrew Barto.Reinforcement Learning: An Intro- duction. Nachdruck. Adaptive Computation and Machine Learning. Cam- bridge, Massachusetts: The MIT Press, 2014.isbn: 978-0-262-19398-6

2014
[12]

A Markovian Decision Process

Richard Bellman. “A Markovian Decision Process”. In:Indiana University Mathematics Journal6 (1957), pp. 679–684

1957
[13]

A Natural Policy Gradient

Sham M Kakade. “A Natural Policy Gradient”. In:Advances in Neural Information Processing Systems. Vol. 14. MIT Press, 2001

2001
[14]

Q-Learning

Christopher J. C. H. Watkins and Peter Dayan. “Q-Learning”. In:Machine Learning8.3 (May 1992), pp. 279–292.issn: 1573-0565.doi:10.1007/ BF00992698

1992
[15]

Actor-Critic Algorithms

Vijay Konda and John Tsitsiklis. “Actor-Critic Algorithms”. In:Advances in Neural Information Processing Systems. Vol. 12. MIT Press, 1999. 18 Deproost et al

1999
[16]

Soft Actor-Critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja et al. “Soft Actor-Critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”. In:Proceedings of the 35th International Conference on Machine Learning. Ed. by Jen- nifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, July 2018, pp. 1861–1870

2018
[17]

John Schulman et al.Proximal Policy Optimization Algorithms. Aug. 2017. doi:10.48550/arXiv.1707.06347. arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017
[18]

Proceedings of the AAAI Conference on Artifi- cial Intelligence38(3), 2148–2156 (Mar 2024).https://doi.org/10.1609/aaai

UjjwalDasGupta,ErikTalvitie,andMichaelBowling.“PolicyTree: Adap- tive Representation for Policy Gradient”. In:Proceedings of the AAAI Con- ference on Artificial Intelligence29.1 (Feb. 2015).doi:10.1609/aaai. v29i1.9613

work page doi:10.1609/aaai 2015
[19]

Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents

Christian Rupprecht, Cyril Ibrahim, and Christopher J Pal. “Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents”. In:In- ternational Conference on Learning Representations. 2020

2020
[20]

Explaining Deep Adaptive Programs via Reward De- composition

Martin Erwig et al. “Explaining Deep Adaptive Programs via Reward De- composition”. In:IJCAI/ECAI Workshop on Explainable Artificial Intel- ligence. 2018

2018
[21]

Toward a Psychology of Deep Reinforce- ment Learning Agents Using a Cognitive Architecture

Konstantinos Mitsopoulos et al. “Toward a Psychology of Deep Reinforce- ment Learning Agents Using a Cognitive Architecture”. In:Topics in cog- nitive science(2021)

2021
[22]

Understanding Individual Agent Importance in Multi-Agent System via Counterfactual Reasoning

Jianming Chen et al. “Understanding Individual Agent Importance in Multi-Agent System via Counterfactual Reasoning”. In:Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty- Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelli- gence....

work page doi:10.1609/aaai.v39i15.33733 2025
[23]

Vector Quantization

R. Gray. “Vector Quantization”. In:IEEE ASSP Magazine1.2 (Apr. 1984), pp. 4–29.issn: 1558-1284.doi:10.1109/MASSP.1984.1162229

work page doi:10.1109/massp.1984.1162229 1984
[24]

Communications of the ACM , issue_date =

Jon Louis Bentley. “Multidimensional Binary Search Trees Used for As- sociative Searching”. In:Communications of the ACM18.9 (Sept. 1975), pp. 509–517.issn: 0001-0782, 1557-7317.doi:10.1145/361002.361007

work page doi:10.1145/361002.361007 1975
[25]

Some Methods for Classification and Analysis of Multi- Variate Observations

J. B. MacQueen. “Some Methods for Classification and Analysis of Multi- Variate Observations”. In:Proc. of the Fifth Berkeley Symposium on Math- ematical Statistics and Probability. Ed. by L. M. Le Cam and J. Neyman. Vol. 1. University of California Press, 1967, pp. 281–297

1967
[26]

why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “"Why Should I Trust You?": Explaining the Predictions of Any Classifier”. In:Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining. Kdd ’16. San Francisco, California, USA and New York, NY, USA: Association for Computing Machinery, 2016, pp. 1135– 1144....

work page doi:10.1145/2939672.2939778 2016
[27]

Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible Critic-driven Voronoi State Partitioning 19 AI

Alejandro Barredo Arrieta et al. “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible Critic-driven Voronoi State Partitioning 19 AI”. In:Information Fusion58 (June 2020), pp. 82–115.issn: 15662535. doi:10.1016/j.inffus.2019.12.012

work page doi:10.1016/j.inffus.2019.12.012 2020
[28]

Explaining a Deep Reinforcement Learning Dock- ing Agent Using Linear Model Trees with User Adapted Visualization

Vilde B. Gjærum et al. “Explaining a Deep Reinforcement Learning Dock- ing Agent Using Linear Model Trees with User Adapted Visualization”. In:Journal of Marine Science and Engineering9.11 (Oct. 2021), p. 1178. issn: 2077-1312.doi:10.3390/jmse9111178. arXiv:2203.00368 [cs]

work page doi:10.3390/jmse9111178 2021
[29]

Toward Interpretable Deep Reinforcement Learning with Linear Model U-Trees

Guiliang Liu et al. “Toward Interpretable Deep Reinforcement Learning with Linear Model U-Trees”. In:Machine Learning and Knowledge Dis- covery in Databases. Ed. by Michele Berlingerio et al. Vol. 11052. Cham: Springer International Publishing, 2019, pp. 414–429.isbn: 978-3-030- 10927-1 978-3-030-10928-8.doi:10.1007/978-3-030-10928-8_25

work page doi:10.1007/978-3-030-10928-8_25 2019
[30]

On the Performance and Ro- bustness of Linear Model U-Trees in Mimic Learning

Matthew Green and John W. Sheppard. “On the Performance and Ro- bustness of Linear Model U-Trees in Mimic Learning”. In:2024 Interna- tional Conference on Machine Learning and Applications (ICMLA). Mi- ami, FL, USA: IEEE, Dec. 2024, pp. 152–159.isbn: 979-8-3503-7488-9. doi:10.1109/ICMLA61862.2024.00027

work page doi:10.1109/icmla61862.2024.00027 2024
[31]

Interpretable and Editable Programmatic Tree Poli- cies for Reinforcement Learning

Hector Kohler et al. “Interpretable and Editable Programmatic Tree Poli- cies for Reinforcement Learning”. In:17th European Workshop on Rein- forcement Learning. Toulouse, Oct. 2024

2024
[32]

Distilling Deep Reinforcement Learning Policies in Soft Decision Trees

Youri Coppens et al. “Distilling Deep Reinforcement Learning Policies in Soft Decision Trees”. In:ProceedingsoftheIJCAI2019Workshop onExplain- ableArtificialIntelligence. 2019, pp. 1–6

2019
[33]

Adaptive State Space Partitioning for Reinforcement Learning

Ivan S.K. Lee and Henry Y.K. Lau. “Adaptive State Space Partitioning for Reinforcement Learning”. In:Engineering Applications of Artificial In- telligence17.6 (Sept. 2004), pp. 577–588.issn: 09521976.doi:10.1016/ j.engappai.2004.08.005

2004
[34]

In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems

Riad Akrour et al. “Regularizing Reinforcement Learning with State Ab- straction”. In:2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Oct. 2018, pp. 534–539.doi:10.1109/IROS. 2018.8594201

work page doi:10.1109/iros 2018
[35]

Mark Towers et al.Gymnasium: A Standard Interface for Reinforcement Learning Environments. Nov. 2024.doi:10.48550/arXiv.2407.17032. arXiv:2407.17032 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.17032 2024

[1] [1]

and Veness, Joel and Bellemare, Marc G

Volodymyr Mnih et al. “Human-Level Control through Deep Reinforce- ment Learning”. In:Nature518.7540 (Feb. 2015), pp. 529–533.issn: 0028- 0836, 1476-4687.doi:10.1038/nature14236

work page doi:10.1038/nature14236 2015

[2] [2]

Continuous control with deep reinforcement learning

Timothy P. Lillicrap et al.Continuous Control with Deep Reinforcement Learning. https://arxiv.org/abs/1509.02971v6. Sept. 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments

Sindhu Padakandla. “A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments”. In:ACM Computing Surveys54.6 (July 2022), pp. 1–25.issn: 0360-0300, 1557-7341.doi:10.1145/3459991

work page doi:10.1145/3459991 2022

[4] [4]

A Tour of Reinforcement Learning: The View from Con- tinuous Control

Benjamin Recht. “A Tour of Reinforcement Learning: The View from Con- tinuous Control”. In:Annual Review of Control, Robotics, and Autonomous Systems2.1 (May 2019), pp. 253–279.issn: 2573-5144, 2573-5144.doi: 10.1146/annurev-control-053018-023825

work page doi:10.1146/annurev-control-053018-023825 2019

[5] [5]

Explanation in Artificial Intelligence:

Tim Miller. “Explanation in Artificial Intelligence: Insights from the Social Sciences”. In:Artificial Intelligence267 (Feb. 2019), pp. 1–38.issn: 0004- 3702.doi:10.1016/j.artint.2018.07.007

work page doi:10.1016/j.artint.2018.07.007 2019

[6] [6]

Generating Inter- pretable Fuzzy Controllers Using Particle Swarm Optimization and Ge- neticProgramming

Daniel Hein, Steffen Udluft, and Thomas A. Runkler. “Generating Inter- pretable Fuzzy Controllers Using Particle Swarm Optimization and Ge- neticProgramming”.In:Proceedings of the Genetic and Evolutionary Com- putation Conference Companion. Kyoto Japan: ACM, July 2018, pp. 1268– 1275.isbn: 978-1-4503-5764-7.doi:10.1145/3205651.3208277

work page doi:10.1145/3205651.3208277 2018

[7] [7]

Imitation-Projected Programmatic Reinforcement Learning

Abhinav Verma et al. “Imitation-Projected Programmatic Reinforcement Learning”.In:Advances in Neural Information Processing Systems.Vol.32. Curran Associates, Inc., 2019

2019

[8] [8]

Explainable Reinforcement Learning (XRL): A Sys- tematic Literature Review and Taxonomy

Yanzhe Bekkemoen. “Explainable Reinforcement Learning (XRL): A Sys- tematic Literature Review and Taxonomy”. In:Machine Learning(Nov. 2023).issn: 0885-6125, 1573-0565.doi:10.1007/s10994-023-06479-7

work page doi:10.1007/s10994-023-06479-7 2023

[9] [9]

Nicholas Frosst and Geoffrey Hinton.Distilling a Neural Network Into a Soft Decision Tree. Nov. 2017. arXiv:1711.09784 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Policy Distillation

Andrei A. Rusu et al. “Policy Distillation”. In:International Conference on Learning Representations. Vol. abs/1511.06295. San Diego, CA, USA, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Sutton and Andrew Barto.Reinforcement Learning: An Intro- duction

Richard S. Sutton and Andrew Barto.Reinforcement Learning: An Intro- duction. Nachdruck. Adaptive Computation and Machine Learning. Cam- bridge, Massachusetts: The MIT Press, 2014.isbn: 978-0-262-19398-6

2014

[12] [12]

A Markovian Decision Process

Richard Bellman. “A Markovian Decision Process”. In:Indiana University Mathematics Journal6 (1957), pp. 679–684

1957

[13] [13]

A Natural Policy Gradient

Sham M Kakade. “A Natural Policy Gradient”. In:Advances in Neural Information Processing Systems. Vol. 14. MIT Press, 2001

2001

[14] [14]

Q-Learning

Christopher J. C. H. Watkins and Peter Dayan. “Q-Learning”. In:Machine Learning8.3 (May 1992), pp. 279–292.issn: 1573-0565.doi:10.1007/ BF00992698

1992

[15] [15]

Actor-Critic Algorithms

Vijay Konda and John Tsitsiklis. “Actor-Critic Algorithms”. In:Advances in Neural Information Processing Systems. Vol. 12. MIT Press, 1999. 18 Deproost et al

1999

[16] [16]

Soft Actor-Critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja et al. “Soft Actor-Critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”. In:Proceedings of the 35th International Conference on Machine Learning. Ed. by Jen- nifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, July 2018, pp. 1861–1870

2018

[17] [17]

John Schulman et al.Proximal Policy Optimization Algorithms. Aug. 2017. doi:10.48550/arXiv.1707.06347. arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1707.06347 2017

[18] [18]

Proceedings of the AAAI Conference on Artifi- cial Intelligence38(3), 2148–2156 (Mar 2024).https://doi.org/10.1609/aaai

UjjwalDasGupta,ErikTalvitie,andMichaelBowling.“PolicyTree: Adap- tive Representation for Policy Gradient”. In:Proceedings of the AAAI Con- ference on Artificial Intelligence29.1 (Feb. 2015).doi:10.1609/aaai. v29i1.9613

work page doi:10.1609/aaai 2015

[19] [19]

Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents

Christian Rupprecht, Cyril Ibrahim, and Christopher J Pal. “Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents”. In:In- ternational Conference on Learning Representations. 2020

2020

[20] [20]

Explaining Deep Adaptive Programs via Reward De- composition

Martin Erwig et al. “Explaining Deep Adaptive Programs via Reward De- composition”. In:IJCAI/ECAI Workshop on Explainable Artificial Intel- ligence. 2018

2018

[21] [21]

Toward a Psychology of Deep Reinforce- ment Learning Agents Using a Cognitive Architecture

Konstantinos Mitsopoulos et al. “Toward a Psychology of Deep Reinforce- ment Learning Agents Using a Cognitive Architecture”. In:Topics in cog- nitive science(2021)

2021

[22] [22]

Understanding Individual Agent Importance in Multi-Agent System via Counterfactual Reasoning

Jianming Chen et al. “Understanding Individual Agent Importance in Multi-Agent System via Counterfactual Reasoning”. In:Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty- Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelli- gence....

work page doi:10.1609/aaai.v39i15.33733 2025

[23] [23]

Vector Quantization

R. Gray. “Vector Quantization”. In:IEEE ASSP Magazine1.2 (Apr. 1984), pp. 4–29.issn: 1558-1284.doi:10.1109/MASSP.1984.1162229

work page doi:10.1109/massp.1984.1162229 1984

[24] [24]

Communications of the ACM , issue_date =

Jon Louis Bentley. “Multidimensional Binary Search Trees Used for As- sociative Searching”. In:Communications of the ACM18.9 (Sept. 1975), pp. 509–517.issn: 0001-0782, 1557-7317.doi:10.1145/361002.361007

work page doi:10.1145/361002.361007 1975

[25] [25]

Some Methods for Classification and Analysis of Multi- Variate Observations

J. B. MacQueen. “Some Methods for Classification and Analysis of Multi- Variate Observations”. In:Proc. of the Fifth Berkeley Symposium on Math- ematical Statistics and Probability. Ed. by L. M. Le Cam and J. Neyman. Vol. 1. University of California Press, 1967, pp. 281–297

1967

[26] [26]

why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “"Why Should I Trust You?": Explaining the Predictions of Any Classifier”. In:Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining. Kdd ’16. San Francisco, California, USA and New York, NY, USA: Association for Computing Machinery, 2016, pp. 1135– 1144....

work page doi:10.1145/2939672.2939778 2016

[27] [27]

Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible Critic-driven Voronoi State Partitioning 19 AI

Alejandro Barredo Arrieta et al. “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible Critic-driven Voronoi State Partitioning 19 AI”. In:Information Fusion58 (June 2020), pp. 82–115.issn: 15662535. doi:10.1016/j.inffus.2019.12.012

work page doi:10.1016/j.inffus.2019.12.012 2020

[28] [28]

Explaining a Deep Reinforcement Learning Dock- ing Agent Using Linear Model Trees with User Adapted Visualization

Vilde B. Gjærum et al. “Explaining a Deep Reinforcement Learning Dock- ing Agent Using Linear Model Trees with User Adapted Visualization”. In:Journal of Marine Science and Engineering9.11 (Oct. 2021), p. 1178. issn: 2077-1312.doi:10.3390/jmse9111178. arXiv:2203.00368 [cs]

work page doi:10.3390/jmse9111178 2021

[29] [29]

Toward Interpretable Deep Reinforcement Learning with Linear Model U-Trees

Guiliang Liu et al. “Toward Interpretable Deep Reinforcement Learning with Linear Model U-Trees”. In:Machine Learning and Knowledge Dis- covery in Databases. Ed. by Michele Berlingerio et al. Vol. 11052. Cham: Springer International Publishing, 2019, pp. 414–429.isbn: 978-3-030- 10927-1 978-3-030-10928-8.doi:10.1007/978-3-030-10928-8_25

work page doi:10.1007/978-3-030-10928-8_25 2019

[30] [30]

On the Performance and Ro- bustness of Linear Model U-Trees in Mimic Learning

Matthew Green and John W. Sheppard. “On the Performance and Ro- bustness of Linear Model U-Trees in Mimic Learning”. In:2024 Interna- tional Conference on Machine Learning and Applications (ICMLA). Mi- ami, FL, USA: IEEE, Dec. 2024, pp. 152–159.isbn: 979-8-3503-7488-9. doi:10.1109/ICMLA61862.2024.00027

work page doi:10.1109/icmla61862.2024.00027 2024

[31] [31]

Interpretable and Editable Programmatic Tree Poli- cies for Reinforcement Learning

Hector Kohler et al. “Interpretable and Editable Programmatic Tree Poli- cies for Reinforcement Learning”. In:17th European Workshop on Rein- forcement Learning. Toulouse, Oct. 2024

2024

[32] [32]

Distilling Deep Reinforcement Learning Policies in Soft Decision Trees

Youri Coppens et al. “Distilling Deep Reinforcement Learning Policies in Soft Decision Trees”. In:ProceedingsoftheIJCAI2019Workshop onExplain- ableArtificialIntelligence. 2019, pp. 1–6

2019

[33] [33]

Adaptive State Space Partitioning for Reinforcement Learning

Ivan S.K. Lee and Henry Y.K. Lau. “Adaptive State Space Partitioning for Reinforcement Learning”. In:Engineering Applications of Artificial In- telligence17.6 (Sept. 2004), pp. 577–588.issn: 09521976.doi:10.1016/ j.engappai.2004.08.005

2004

[34] [34]

In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems

Riad Akrour et al. “Regularizing Reinforcement Learning with State Ab- straction”. In:2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Oct. 2018, pp. 534–539.doi:10.1109/IROS. 2018.8594201

work page doi:10.1109/iros 2018

[35] [35]

Mark Towers et al.Gymnasium: A Standard Interface for Reinforcement Learning Environments. Nov. 2024.doi:10.48550/arXiv.2407.17032. arXiv:2407.17032 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.17032 2024