Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies

Ann Now\'e; Mehrdad Asadi; Senne Deproost

arxiv: 2605.04254 · v3 · pith:F6OZ7UILnew · submitted 2026-05-05 · 💻 cs.LG · cs.HC

Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies

Senne Deproost , Mehrdad Asadi , Ann Now\'e This is my paper

Pith reviewed 2026-06-30 23:50 UTC · model grok-4.3

classification 💻 cs.LG cs.HC

keywords reinforcement learningpolicy distillationsupport vector machinesstate partitioningblack-box policiesinterpretabilitysubpolicies

0 comments

The pith

SVSP partitions state-action data with linear SVMs to distill black-box RL policies into fewer interpretable subpolicies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces State Vector Space Partitioning to mimic black-box reinforcement learning policies by splitting a dataset of state-action pairs into regions using support vector machine boundaries. Each region gets its own subpolicy, producing a structured and human-readable approximation of the original agent. This yields higher average returns than both the original TD3 policy and an earlier Voronoi-based partitioning method, while using substantially fewer subpolicies. A reader would care because the approach turns opaque high-performing policies into collections of simpler pieces whose decisions can be inspected and adjusted.

Core claim

SVSP constructs a compact representation of a black-box policy by partitioning a distillation dataset of state-action pairs with linear support vector machine splits. The resulting subpolicies achieve a mean return 7.4 percent higher than Voronoi State Partitioning and 2.8 percent higher than the original TD3 policy while requiring 82.1 percent fewer subpolicies than VSP. The method thereby enables a more flexible distillation in which decision boundaries and surrogate models can be selected within a margin of the original black-box behavior.

What carries the argument

Linear support vector machine splits that divide the state vector space according to state-action pairs sampled from the black-box policy.

If this is right

The distilled policy can exceed the mean return of the original black-box policy.
The number of subpolicies can be reduced by more than 80 percent relative to prior partitioning techniques while preserving or improving performance.
Decision boundaries can be chosen flexibly within a margin of the black-box behavior.
Surrogate models inside each partition can also be selected within that same margin.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partitioning idea could be tested on policies trained in continuous control tasks beyond the environments used here to check whether the reduction in subpolicy count scales.
If the SVM boundaries remain stable under small changes to the distillation dataset, the method might support incremental updates when new experience is collected.
Inspecting the linear boundaries could reveal which state features most strongly influence the original policy's choices in different regions.

Load-bearing premise

The sampled state-action pairs are representative enough that linear SVM boundaries will produce subpolicies whose combined behavior matches the original policy over the entire state space.

What would settle it

A large performance gap between the combined subpolicies and the original policy on states outside the distillation dataset would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.04254 by Ann Now\'e, Mehrdad Asadi, Senne Deproost.

**Figure 1.** Figure 1: Comparison of the decision boundaries between VSP and SVSP for Lunar Lander. Both the partitioning on the X and Y positions is given with 0 for all other state variables. We notice that SVSP has decision boundaries that are significantly reduced in complexity. stability while accounting for gravity and avoiding crashes. An original TD3 agent from StableBaselines3 [6] is trained, and the state-actions of 1… view at source ↗

read the original abstract

We introduce State Vector Space Partitioning (SVSP), a novel method to mimic a black box reinforcement learning policy using a set of human-interpretable subpolicies. By partitioning a distillation dataset of state action pairs with linear support vector machine splits, SVSP constructs a compact and structured representation of the original policy. Our method improves mean return by +7.4% over previous critic driven state partitioning attempts such as Voronoi State Partitioning (VSP) and +2.8% over the original TD3 policy, while reducing the number of required subpolicies against VSP by 82.1%. Our results pave the path towards a more flexible form of distillation where both the decision boundary and surrogate models can be chosen within a margin of the original black box behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SVSP swaps Voronoi for SVM splits in RL policy distillation and claims fewer subpolicies plus small gains over the teacher, but the abstract supplies almost no experimental detail to back it.

read the letter

The new piece here is using linear SVMs to draw the decision boundaries when partitioning a distillation dataset of state-action pairs from a black-box policy like TD3. That replaces the Voronoi-style splits from earlier VSP work and is presented as producing a more compact hierarchy of subpolicies.

The reported numbers are a 7.4% lift in mean return over VSP, a 2.8% lift over the original TD3, and an 82% reduction in the number of subpolicies. If those hold under proper controls, the reduction in subpolicies would be the most practically useful part for anyone trying to make distilled policies easier to inspect or deploy.

The soft spot is the complete absence of experimental specifics in the abstract: no list of environments, no mention of random seeds, error bars, statistical tests, or even basic training details. A claimed improvement over the teacher policy itself is unusual enough that it needs those controls to be believable rather than an artifact of evaluation setup. The underlying assumption—that a finite rollout dataset will be dense enough for linear separators to generalize across the full continuous state space—also remains untested in the summary, and that is exactly where these methods tend to break in high-dimensional control tasks.

This is a modest, incremental tweak aimed at people already working on critic-driven or geometry-based distillation for interpretability. It does not look like a load-bearing flaw in the central idea, but the current write-up does not give enough evidence to judge whether the gains are real. I would send it to review so referees can check the full experiments and ablations; without that step it is not ready to cite or build on.

Referee Report

2 major / 2 minor

Summary. The paper introduces State Vector Space Partitioning (SVSP), a method to distill black-box RL policies (e.g., TD3) into a hierarchy of interpretable subpolicies. It partitions a distillation dataset of state-action pairs using linear SVM splits to define regions, each assigned a surrogate subpolicy. The central empirical claim is that SVSP yields +7.4% higher mean return than Voronoi State Partitioning (VSP), +2.8% higher than the original TD3 teacher, and requires 82.1% fewer subpolicies than VSP.

Significance. If the reported gains prove robust across environments and seeds, SVSP would offer a concrete advance in policy distillation by replacing critic-driven or Voronoi partitioning with margin-based linear separators that are both compact and human-interpretable. The reduction in subpolicy count is a practically useful efficiency result. The work also highlights the possibility of choosing both the decision boundary and the local models within a controlled margin of the teacher.

major comments (2)

[Abstract] Abstract: the quantitative claims (+7.4 % over VSP, +2.8 % over TD3, 82.1 % reduction in subpolicies) are stated without any reference to the number of environments, number of independent trials, statistical tests, error bars, or sensitivity to random seeds and hyper-parameters. These omissions make it impossible to judge whether the reported improvements are load-bearing for the central claim or could be artifacts of a single run.
[Experiments] Experimental evaluation: the method's correctness rests on the assumption that a finite set of rollout trajectories is sufficiently dense for linear SVM boundaries to generalize across the full continuous state space. No coverage analysis, density plots, or out-of-distribution evaluation is supplied to test this assumption; without it the +2.8 % gain over the teacher itself cannot be confidently attributed to the partitioning rather than to incomplete sampling.

minor comments (2)

[Method] Clarify in §3 how the subpolicies inside each SVM region are trained (e.g., on the same distillation data or on additional rollouts) and whether they are allowed to differ from the teacher only inside their assigned region.
[Results] Add a table or figure that directly compares the number of support vectors / subpolicies and the achieved return for SVSP versus VSP across all reported environments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our empirical results. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the quantitative claims (+7.4 % over VSP, +2.8 % over TD3, 82.1 % reduction in subpolicies) are stated without any reference to the number of environments, number of independent trials, statistical tests, error bars, or sensitivity to random seeds and hyper-parameters. These omissions make it impossible to judge whether the reported improvements are load-bearing for the central claim or could be artifacts of a single run.

Authors: We agree that the abstract would be strengthened by additional context. In the revised version we will expand the abstract to state that results are reported over five MuJoCo environments, five independent random seeds per environment, with mean and standard deviation, and that full statistical comparisons appear in Section 4. revision: yes
Referee: [Experiments] Experimental evaluation: the method's correctness rests on the assumption that a finite set of rollout trajectories is sufficiently dense for linear SVM boundaries to generalize across the full continuous state space. No coverage analysis, density plots, or out-of-distribution evaluation is supplied to test this assumption; without it the +2.8 % gain over the teacher itself cannot be confidently attributed to the partitioning rather than to incomplete sampling.

Authors: We acknowledge the value of explicit coverage analysis. The current experiments follow standard practice in policy distillation by collecting trajectories from the converged teacher on benchmark tasks; however, we will add a dedicated paragraph in Section 4 discussing trajectory density and include state-space coverage visualizations for the evaluated environments. The reported +2.8 % improvement is obtained under identical data-collection conditions for all compared methods, supporting attribution to the partitioning rather than sampling artifacts. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains measured independently of method definition

full rationale

The paper defines SVSP via linear SVM partitioning of a distillation dataset of state-action pairs drawn from the black-box TD3 policy, then reports measured mean-return improvements on evaluation rollouts (+7.4% vs VSP, +2.8% vs teacher, 82.1% fewer subpolicies). These quantities are not obtained by algebraic rearrangement or by fitting the same parameters that define the partitions; they are external performance metrics. No equations, self-citations, or uniqueness theorems are invoked that would reduce the central claim to a tautology or to a fitted input renamed as prediction. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard RL assumptions. The method implicitly assumes that linear separators suffice for the policy's decision boundaries.

pith-pipeline@v0.9.1-grok · 5664 in / 1244 out tokens · 19275 ms · 2026-06-30T23:50:17.894806+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 2 canonical work pages

[1]

& Guestrin, C

Ribeiro, M., Singh, S. & Guestrin, C. ” Why should i trust you?” Explaining the predictions of any classifier.Proceedings Of The 22nd ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 1135-1144 (2016)

2016
[2]

& Now ´e, A

Deproost, S., Steckelmacher, D. & Now ´e, A. Explainable RL Policies by Distilling to Locally- Specialized Linear Policies with V oronoi State Partitioning.ArXiv Preprint ArXiv:2511.13322. (2025)

work page arXiv 2025
[3]

& Preux, P

Kohler, H., Delfosse, Q., Akrour, R., Kersting, K. & Preux, P. Interpretable and Editable Programmatic Tree Policies for Reinforcement Learning. (2024,10,28)

2024
[4]

& Magazzeni, D

Coppens, Y ., Efthymiadis, K., Lenaerts, T., Now ´e, A., Miller, T., Weber, R. & Magazzeni, D. Distilling deep reinforcement learning policies in soft decision trees.Proceedings Of The IJCAI 2019 Workshop On Explainable Artificial Intelligence. pp. 1-6 (2019)

2019
[5]

& Puerto, J

Blanco, V ., Jap ´on, A. & Puerto, J. Multiclass optimal classification trees with svm-splits.Machine Learning.112, 4905-4928 (2023)

2023
[6]

& Zhokhov, P

Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y . & Zhokhov, P. OpenAI Baselines.GitHub Repository. (2017), https://github.com/openai/baselines

2017
[7]

& Solar-Lezama, A

Bastani, O., Pu, Y . & Solar-Lezama, A. Verifiable reinforcement learning via policy extraction.Advances In Neural Information Processing Systems.31(2018)

2018
[8]

& Now ´e, A

Deproost, S., Steckelmacher, D. & Now ´e, A. Human-Readable Programs as Actors of Reinforcement Learning Agents Using Critic-Moderated Evolution.ArXiv Preprint ArXiv:2410.21940. (2024)

work page arXiv 2024

[1] [1]

& Guestrin, C

Ribeiro, M., Singh, S. & Guestrin, C. ” Why should i trust you?” Explaining the predictions of any classifier.Proceedings Of The 22nd ACM SIGKDD International Conference On Knowledge Discovery And Data Mining. pp. 1135-1144 (2016)

2016

[2] [2]

& Now ´e, A

Deproost, S., Steckelmacher, D. & Now ´e, A. Explainable RL Policies by Distilling to Locally- Specialized Linear Policies with V oronoi State Partitioning.ArXiv Preprint ArXiv:2511.13322. (2025)

work page arXiv 2025

[3] [3]

& Preux, P

Kohler, H., Delfosse, Q., Akrour, R., Kersting, K. & Preux, P. Interpretable and Editable Programmatic Tree Policies for Reinforcement Learning. (2024,10,28)

2024

[4] [4]

& Magazzeni, D

Coppens, Y ., Efthymiadis, K., Lenaerts, T., Now ´e, A., Miller, T., Weber, R. & Magazzeni, D. Distilling deep reinforcement learning policies in soft decision trees.Proceedings Of The IJCAI 2019 Workshop On Explainable Artificial Intelligence. pp. 1-6 (2019)

2019

[5] [5]

& Puerto, J

Blanco, V ., Jap ´on, A. & Puerto, J. Multiclass optimal classification trees with svm-splits.Machine Learning.112, 4905-4928 (2023)

2023

[6] [6]

& Zhokhov, P

Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y . & Zhokhov, P. OpenAI Baselines.GitHub Repository. (2017), https://github.com/openai/baselines

2017

[7] [7]

& Solar-Lezama, A

Bastani, O., Pu, Y . & Solar-Lezama, A. Verifiable reinforcement learning via policy extraction.Advances In Neural Information Processing Systems.31(2018)

2018

[8] [8]

& Now ´e, A

Deproost, S., Steckelmacher, D. & Now ´e, A. Human-Readable Programs as Actors of Reinforcement Learning Agents Using Critic-Moderated Evolution.ArXiv Preprint ArXiv:2410.21940. (2024)

work page arXiv 2024