Selecting Decision-Relevant Concepts in Reinforcement Learning

Fei Fang; Naveen Raman; Stephanie Milani

arxiv: 2604.04808 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.AI

Selecting Decision-Relevant Concepts in Reinforcement Learning

Naveen Raman , Stephanie Milani , Fei Fang This is my paper

Pith reviewed 2026-05-10 20:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningconcept-based policiesstate abstractiondecision-relevant conceptsinterpretable RLDRS algorithmsequential decisionshealthcare RL

0 comments

The pith

A state abstraction approach lets reinforcement learning agents automatically select decision-relevant concepts with performance guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Manual selection of concepts for interpretable reinforcement learning policies requires domain expertise and offers no performance assurances. The paper reframes this as a state abstraction problem, where a concept is decision-relevant if it prevents confusing states that demand different actions. This view enables the Decision-Relevant Selection algorithm to choose a subset of concepts while bounding the resulting policy's performance relative to the full state space. Empirically, it matches or exceeds hand-picked concepts on standard benchmarks and in healthcare settings, and supports better test-time interventions. Readers would care because it automates a key step toward reliable, human-understandable sequential decision agents.

Core claim

Concept selection can be viewed through state abstraction: a concept is decision-relevant if removing it would cause the agent to confuse states requiring different actions. Agents relying on such concepts ensure that states with the same concept representation share the same optimal action, thereby preserving the optimal decision structure. This leads to the DRS algorithm that selects subsets from candidates along with performance bounds, and empirically recovers manually curated sets while matching or exceeding their performance.

What carries the argument

Decision-Relevant Selection (DRS) algorithm, which selects concepts by ensuring same-concept states have identical optimal actions via state abstraction.

If this is right

DRS provides performance bounds relating selected concepts to the original policy's performance.
Selected concepts enable effective test-time interventions in RL environments.
The method recovers expert-curated concept sets automatically across benchmarks.
It improves outcomes in real-world applications such as healthcare decision support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The state abstraction perspective could apply to selecting interpretable features in non-sequential machine learning tasks.
Future work might combine DRS with other abstraction methods like bisimulations for more robust concept choices.
Scalability to high-dimensional concept candidate sets remains an open question for broader adoption.

Load-bearing premise

The premise that states sharing a concept representation will have the same optimal action, preserving the original decision structure.

What would settle it

An environment where applying DRS yields a policy that confuses states with different optimal actions under the selected concepts, leading to suboptimal performance not bounded as claimed.

Figures

Figures reproduced from arXiv: 2604.04808 by Fei Fang, Naveen Raman, Stephanie Milani.

**Figure 1.** Figure 1: Standard pipeline for training concept-based policies. Practitioners select concepts for decision-making through a laborintensive process of iteratively selecting candidate concepts, training concept-based policies, and evaluating their performance. 2023). Concept-based models have three advantages: 1) interpretability is built into the model, 2) poor decisions can be traced to concept prediction errors,… view at source ↗

**Figure 2.** Figure 2: Concept-based models rely on a set of decision-relevant concepts that help distinguish between different states, yet currently these concepts are manually selected. In this work, we study how to identify and select decision-relevant concepts. Our key insight is that decision-relevant concepts best separate “different” states, where difference is defined by their decision consequences. We use this insight t… view at source ↗

**Figure 3.** Figure 3: Normalized reward of concept selection algorithms with perfect (top) and imperfect (bottom) concept predictors. Our algorithm, DRS, improves performance compared to the random, variance, and greedy baselines for four out of five environments in the perfect setting. DRS and DRS-log improve performance or are optimal in all environments in the imperfect setting. 3. RQ3: How do automatically selected concepts… view at source ↗

**Figure 4.** Figure 4: We vary the number of timesteps that we train policies for in MiniGrid, while also varying the accuracy of concept predictors (left) or the number of concepts selected (right). Increasing the accuracy of concept predictors speeds up training, while increasing the number of concepts increases the maximum performance. Variance, which selects concepts by ranking them according to their variation between being… view at source ↗

**Figure 5.** Figure 5: Impact of the number of concepts selected against the accuracy of the underlying concepts. Increasing either number or accuracy of concepts has a similar impact on performance, and that sufficiently accurate and many concepts are needed to ensure good performance. 0.85 0.90 0.95 1.00 Accuracy 20 40 60 Reward Cart Pole 0.96 0.98 1.00 Accuracy 0 50 100 Mini Grid 0.6 0.8 1.0 Accuracy 0 25 50 75 Pong 0.97 0.98… view at source ↗

**Figure 6.** Figure 6: We assess the impact of concept selection upon test-time intervention across four environments. Across all four environments, the DRS algorithm has the highest reward both before and after intervention, showing that well-selected concepts improve intervention. vironment, we fix k = K 4 . 3 For perfect settings, DRS performs best for four out of five environments, performing 159% better than the best baseli… view at source ↗

**Figure 7.** Figure 7: The DRS algorithm can mimic manual concept selection in CUB. It matches performance (left) while selecting a subset of the manually selected concepts (right). to large performance gains. We note that DRS-log can perform worse under intervention because it is optimized for the original predictor accuracy profile. When intervention renders a subset of concept predictors perfectly accurate, the originally se… view at source ↗

**Figure 8.** Figure 8: We assess the performance of a) not relaxing constraint P1c for DRS and b) varying ρ for CartPole and MiniGrid. We find that ρ = 0.75 performs well across perfect and imperfect settings, while introducing P1c leads to lower performance in CartPole. D. Hyperparameter Selection for DRS We compare our DRS algorithm against two variations of the algorithm: one which incorporates P1c, along with others which va… view at source ↗

**Figure 9.** Figure 9: We compare concept selection algorithms while varying the number of concepts selected k with perfect (top) and imperfect (bottom) concept predictors. We find that for CartPole and MiniGrid, DRS outperforms baselines across values of k, while for environments such as Boxing and Glucose, a threshold number of concepts is needed before DRS outperforms baselines [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: We compare the time to compute across methods and datasets, and find that all algorithms run in under ten minutes. While DRS and DRS-log take longer than variance and greedy, they still run quickly across environments. 1 2 3 4 0 50 100 Reward 1 2 3 4 0 50 100 Timesteps (x1M) 75% k=1 85% k=2 95% k=3 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 12.** Figure 12: We assess test-time intervention performance when varying the level of intervention α in CartPole. Across all α, the DRS algorithm performs best, with steadily improving performance as α is increased. J. Concept Selection Algorithms in Supervised Settings We extend the algorithms from Section 4 in supervised settings to use with the CUB dataset. In this setting, rollouts are replaced by labeled examples… view at source ↗

**Figure 13.** Figure 13: We vary the number of timesteps we use to train the policy π without concepts in MiniGrid. Typically, we have π = π ∗ , but here, we demonstrate that the DRS and DRS-log algorithms perform well even when π is trained for only 100k timesteps. concept accuracies into the coverage constraints, replacing hard coverage with a probabilistic notion of separation that accounts for uncertainty in concept predicti… view at source ↗

read the original abstract

Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions. This selection demands domain expertise, is time-consuming and costly, scales poorly with the number of candidates, and provides no performance guarantees. To overcome this limitation, we propose the first algorithms for principled automatic concept selection in sequential decision-making. Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions. As a result, agents should rely on decision-relevant concepts; states with the same concept representation should share the same optimal action, which preserves the optimal decision structure of the original state space. This perspective leads to the Decision-Relevant Selection (DRS) algorithm, which selects a subset of concepts from a candidate set, along with performance bounds relating the selected concepts to the performance of the resulting policy. Empirically, DRS automatically recovers manually curated concept sets while matching or exceeding their performance, and improves the effectiveness of test-time concept interventions across reinforcement learning benchmarks and real-world healthcare environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRS frames concept selection in RL as state abstraction to preserve optimal action distinctions and supplies performance bounds, but the approach may still need optimal-action knowledge to operate.

read the letter

The paper's core move is to treat decision-relevant concepts as those that induce a state abstraction where identical concept vectors imply identical optimal actions. This lets them define the Decision-Relevant Selection algorithm and attach bounds that relate the chosen subset to the performance of the policy built on top of it. Empirically they show the method recovers manually chosen concept sets on standard RL benchmarks and a healthcare environment while matching or beating their performance and improving test-time interventions.

Referee Report

2 major / 1 minor

Summary. The paper claims that concept selection in RL can be framed as state abstraction, where a concept is decision-relevant if states sharing its representation have identical optimal actions (preserving decision structure). This leads to the DRS algorithm for automatic subset selection from candidates, accompanied by performance bounds on the resulting policy, with empirical results showing automatic recovery of manual concept sets and competitive or superior performance on RL benchmarks and real-world healthcare environments.

Significance. If the central claims hold without circularity and with valid bounds, the work would enable principled, automatic selection of interpretable concepts for RL policies, reducing dependence on manual domain expertise while providing performance guarantees; this could meaningfully advance interpretable sequential decision-making in high-stakes domains.

major comments (2)

[Abstract] Abstract: the key insight equates decision-relevance with an abstraction in which identical concept vectors imply identical optimal actions. Operationalizing selection via partitioning states by candidate subsets and verifying action constancy within blocks requires access to (or estimation of) optimal actions/Q* for each state under varying projections. This presupposes a solution to the original MDP and introduces circularity for the motivating use case of learning policies when the optimal policy is unknown or expensive to compute.
[Abstract] Abstract (performance bounds): the bounds are asserted to relate selected concepts to policy performance, but the excerpt provides no derivation, assumptions, or error analysis. Without these details it is impossible to evaluate whether the bounds are non-vacuous or whether they apply only conditionally on already knowing the decision structure the method claims to recover.

minor comments (1)

The abstract is concise but the full manuscript should include explicit pseudocode or algorithmic steps for DRS to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting two important issues in the abstract: the potential circularity in operationalizing decision-relevance and the lack of visible derivation for the performance bounds. We address both points below. Where the comments identify gaps in exposition or assumptions, we have revised the manuscript to improve clarity without altering the core claims or algorithms.

read point-by-point responses

Referee: [Abstract] Abstract: the key insight equates decision-relevance with an abstraction in which identical concept vectors imply identical optimal actions. Operationalizing selection via partitioning states by candidate subsets and verifying action constancy within blocks requires access to (or estimation of) optimal actions/Q* for each state under varying projections. This presupposes a solution to the original MDP and introduces circularity for the motivating use case of learning policies when the optimal policy is unknown or expensive to compute.

Authors: We agree that a literal requirement for the exact optimal action function Q* would render the procedure circular for the primary use case of learning policies from scratch. The DRS algorithm is therefore instantiated with an approximate policy (or Q-function) obtained from a short initial training run or simulator rollouts; the selected concepts are then used to train or fine-tune the final policy. This two-stage procedure is already described in Section 3.2 and Algorithm 1, but we have added an explicit paragraph in the revised abstract and introduction clarifying the approximation, its error tolerance, and the conditions under which the recovered concepts remain decision-relevant. The theoretical analysis in Section 4 continues to use the exact optimal policy for the purpose of proving bounds, which we now flag as an idealized reference point rather than a prerequisite for practical deployment. revision: yes
Referee: [Abstract] Abstract (performance bounds): the bounds are asserted to relate selected concepts to policy performance, but the excerpt provides no derivation, assumptions, or error analysis. Without these details it is impossible to evaluate whether the bounds are non-vacuous or whether they apply only conditionally on already knowing the decision structure the method claims to recover.

Authors: The performance bounds appear in Theorem 4.1 and are derived from the standard state-abstraction regret analysis under the assumption that the selected concept mapping preserves the optimal action partition. The key assumptions (finite state space, known transition model or simulator, and Lipschitz continuity of the value function) are stated in Section 4.1; the proof shows that the sub-optimality gap is at most the diameter of the largest abstraction block times a discount-dependent constant. We have inserted a concise statement of the main assumptions and a one-sentence sketch of the bound into the revised abstract, while retaining the full derivation and error analysis in the main text. The bounds are therefore not conditional on already knowing the final decision structure; they quantify the worst-case loss relative to the optimal policy of the original MDP once the selected concepts are fixed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; definition and bounds are self-contained via standard state abstraction without reducing to fitted inputs or self-citations.

full rationale

The paper's core insight equates decision-relevance with state abstractions that preserve optimal action uniformity across concept-equivalent states, leading to DRS selection and performance bounds. No equations, fitted parameters, or self-citations are shown in the provided text that would force the result by construction. The approach imports the abstraction lens as an external perspective and claims empirical recovery of curated sets plus bounds relating selection quality to policy performance, remaining independent of the target policy itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the state-abstraction view of decision relevance; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)

domain assumption A concept is decision-relevant if removing it would cause the agent to confuse states that require different actions; states with the same concept representation share the same optimal action.
Key insight stated in abstract that directly motivates DRS and the performance bounds.

pith-pipeline@v0.9.0 · 5495 in / 1202 out tokens · 51899 ms · 2026-05-10T20:17:43.116415+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

states with the same concept representation should share the same optimal action, which preserves the optimal decision structure of the original state space... ϵQπ(g) := max_{s,s′:g(s)=g(s′)} D_{s,s′}
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

V π∗(s)−V π∗c(s)≤ 2ϵ(gc)/(1−γ)²

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

We discretize each of these with two thresholds for position and angle, and four thresholds for angular velocity and velocity

CartPoleoriginally has a state space consisting of four numbers: the position, velocity, angle, and angular ve- locity. We discretize each of these with two thresholds for position and angle, and four thresholds for angular velocity and velocity. This gives a total of 12 concepts

work page
[2]

MiniGridis characterized through the agent’s position, door location, key obtaining, and door unlock status. 10 Selecting Decision-Relevant Concepts in Reinforcement Learning For MiniGrid, we use a fully symbolic state represen- tation consisting of 12 discrete features extracted from the final observation frame. These features encode the agent’s(x, y) po...

work page
[3]

Base features include continuous-valued positions, velocities, and relative offsets of the agent paddle, ball, and opponent paddle

Ponghas concepts derived from object-centric state variables computed from the last one or two frames of the observation stack. Base features include continuous-valued positions, velocities, and relative offsets of the agent paddle, ball, and opponent paddle. Specifically, we extract the agent paddle’s vertical posi- tion; the ball’s horizontal and vertic...

work page
[4]

Boxingconcepts are constructed from the positions and movements of the player and opponent sprites. Base features include the normalized x- and y-positions of both the player and the enemy in the final observation frame, as well as their velocities computed as differ- ences between the last two frames. We additionally include relative position features ca...

work page
[5]

Each feature corresponds to a scalar quantity, such as an absolute level, rate of change, or control signal, and no tem- poral differencing is applied

Glucosehas a state which consists of six continuous- valued physiological or control-related variables ex- tracted from the final observation frame. Each feature corresponds to a scalar quantity, such as an absolute level, rate of change, or control signal, and no tem- poral differencing is applied. Concepts are defined by thresholding each feature at a h...

work page 2017

[1] [1]

We discretize each of these with two thresholds for position and angle, and four thresholds for angular velocity and velocity

CartPoleoriginally has a state space consisting of four numbers: the position, velocity, angle, and angular ve- locity. We discretize each of these with two thresholds for position and angle, and four thresholds for angular velocity and velocity. This gives a total of 12 concepts

work page

[2] [2]

MiniGridis characterized through the agent’s position, door location, key obtaining, and door unlock status. 10 Selecting Decision-Relevant Concepts in Reinforcement Learning For MiniGrid, we use a fully symbolic state represen- tation consisting of 12 discrete features extracted from the final observation frame. These features encode the agent’s(x, y) po...

work page

[3] [3]

Base features include continuous-valued positions, velocities, and relative offsets of the agent paddle, ball, and opponent paddle

Ponghas concepts derived from object-centric state variables computed from the last one or two frames of the observation stack. Base features include continuous-valued positions, velocities, and relative offsets of the agent paddle, ball, and opponent paddle. Specifically, we extract the agent paddle’s vertical posi- tion; the ball’s horizontal and vertic...

work page

[4] [4]

Boxingconcepts are constructed from the positions and movements of the player and opponent sprites. Base features include the normalized x- and y-positions of both the player and the enemy in the final observation frame, as well as their velocities computed as differ- ences between the last two frames. We additionally include relative position features ca...

work page

[5] [5]

Each feature corresponds to a scalar quantity, such as an absolute level, rate of change, or control signal, and no tem- poral differencing is applied

Glucosehas a state which consists of six continuous- valued physiological or control-related variables ex- tracted from the final observation frame. Each feature corresponds to a scalar quantity, such as an absolute level, rate of change, or control signal, and no tem- poral differencing is applied. Concepts are defined by thresholding each feature at a h...

work page 2017