Distributions as Actions: A Unified Framework for Diverse Action Spaces

A. Rupam Mahmood; Jiamin He; Martha White

arxiv: 2506.16608 · v3 · pith:MHNTPSXBnew · submitted 2025-06-19 · 💻 cs.LG · cs.AI

Distributions as Actions: A Unified Framework for Diverse Action Spaces

Jiamin He , A. Rupam Mahmood , Martha White This is my paper

Pith reviewed 2026-05-19 08:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningpolicy gradientaction distributionsdeterministic policiesactor-criticunified frameworkdiscrete continuous hybrid control

0 comments

The pith

Treating parameterized action distributions as the actions themselves creates a continuous action space usable for any original action type in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that redefining the agent's choice as a distribution over actions rather than a single action makes the effective action space continuous, even when the underlying actions are discrete or hybrid. This shift supports a generalized deterministic policy gradient estimator called DA-PG that operates on distribution parameters and shows lower variance than gradients taken in the original action space. To make the critic work over these parameters, the paper introduces Interpolated Critic Learning as a practical fix. The resulting DA-AC algorithm, built on top of TD3, is tested across discrete, continuous, and hybrid control problems and reaches competitive performance. A sympathetic reader would care because one set of tools could then replace the current patchwork of methods needed for different action space types.

Core claim

By treating parameterized action distributions as actions, the boundary between agent and environment is redefined so that the new action space is continuous regardless of the original action type. Under this parameterization a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), is derived that exhibits lower variance than the gradient computed in the original action space. Although learning the critic over distribution parameters introduces new challenges, Interpolated Critic Learning addresses them, and the resulting Distributions-as-Actions Actor-Critic algorithm achieves competitive performance across discrete, continuous, and hybrid控制.

What carries the argument

The Distributions-as-Actions reparameterization that shifts the policy output to the parameters of an action distribution, allowing deterministic gradients to be taken directly with respect to those parameters instead of raw actions.

If this is right

The effective action space is continuous for discrete, continuous, and hybrid original actions.
DA-PG produces a lower-variance gradient estimate than the corresponding estimator in the original action space.
Interpolated Critic Learning enables stable critic training when the critic receives distribution parameters as input.
DA-AC built on TD3 reaches competitive returns on discrete, continuous, and hybrid control tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reparameterization could be applied to other actor-critic or policy-search methods without changing their core update rules.
Implementation of RL systems could become simpler because separate code paths for discrete versus continuous actions would no longer be required.
The variance reduction might become more pronounced in high-dimensional discrete action spaces where standard estimators suffer from high variance.
The framework could be tested on environments that mix continuous and discrete actions within the same time step to check whether the unified treatment removes the need for ad-hoc action masking.

Load-bearing premise

That Interpolated Critic Learning can adequately address the challenges of training a critic over the parameters of action distributions.

What would settle it

A direct comparison experiment in which the empirical variance of the DA-PG estimator exceeds that of the standard gradient estimator on the original action space, or in which DA-AC underperforms TD3 on standard continuous-control benchmarks.

read the original abstract

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is to treat distribution parameters as the actions themselves so any original space becomes continuous, but the lower-variance claim for DA-PG rests on comparisons that do not yet isolate the reparameterization cleanly.

read the letter

The colleague should know two things up front. First, the central idea is to make the action the vector of parameters for the policy distribution rather than a sample from it. This turns discrete or hybrid problems into continuous optimization over those parameters. Second, they build a policy gradient estimator called DA-PG on top of that and claim it has lower variance than the usual estimator in the original action space, then add Interpolated Critic Learning to make the critic work in the new space and test a TD3-style algorithm called DA-AC.

Referee Report

2 major / 2 minor

Summary. The paper introduces a unified RL framework that treats parameterized distributions over actions as the agent's actions, thereby converting any original action space (discrete, continuous, or hybrid) into a continuous space of distribution parameters. Under this reparameterization the authors derive a generalized deterministic policy gradient estimator (DA-PG) that is claimed to exhibit lower variance than the estimator computed directly in the original action space. They further propose Interpolated Critic Learning (ICL) to stabilize critic training over the new parameter space and present the actor-critic algorithm DA-AC, which is built on TD3 and is reported to achieve competitive performance across discrete, continuous, and hybrid control benchmarks.

Significance. If the variance-reduction property of DA-PG can be shown to hold after controlling for dimensionality and critic-learning changes, and if the empirical gains are reproducible under matched conditions, the framework would offer a principled way to unify policy-gradient methods across heterogeneous action spaces. The explicit treatment of distribution parameters as actions also opens a route for applying continuous-control techniques to discrete and hybrid problems without ad-hoc discretization or relaxation.

major comments (2)

[Abstract and §3] Abstract and §3: The central claim that DA-PG possesses lower variance than the original-action-space gradient is load-bearing for the paper’s contribution, yet the manuscript provides neither a variance bound that accounts for the increased dimensionality of the distribution-parameter vector nor a controlled empirical comparison that isolates the reparameterization effect from the simultaneous introduction of ICL and changes in critic architecture.
[§5] §5 (Experiments): The reported competitive performance of DA-AC versus TD3 and other baselines would be strengthened by an ablation that keeps the critic network, replay buffer, and hyper-parameters fixed while toggling only the action representation (original vs. distribution parameters) and the use of ICL; without such isolation it remains unclear whether observed improvements stem from the DA-PG estimator itself.

minor comments (2)

[§2] Notation for the distribution-parameter vector and the mapping from parameters to action distributions should be introduced once in §2 and used consistently thereafter to avoid ambiguity when comparing gradients in the two spaces.
[§5] Figure captions and axis labels in the experimental section would benefit from explicit mention of the number of independent seeds and whether shaded regions represent standard error or standard deviation.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying the basis for our variance claims and committing to additional experiments that better isolate the effects of our reparameterization.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: The central claim that DA-PG possesses lower variance than the original-action-space gradient is load-bearing for the paper’s contribution, yet the manuscript provides neither a variance bound that accounts for the increased dimensionality of the distribution-parameter vector nor a controlled empirical comparison that isolates the reparameterization effect from the simultaneous introduction of ICL and changes in critic architecture.

Authors: The DA-PG estimator applies the deterministic policy gradient theorem in the continuous space of distribution parameters, thereby removing the additional Monte-Carlo sampling variance that arises when gradients are taken directly over discrete or hybrid actions. While a closed-form variance bound that explicitly incorporates the higher dimensionality of the parameter vector is not derived in the current manuscript, the reparameterization converts stochastic policy gradients into deterministic ones, which is the source of the variance reduction we claim. To isolate the reparameterization effect from ICL and critic changes, we will add a controlled ablation in the revision that re-uses the identical critic architecture and training procedure for both the original action space and the distribution-parameter space. revision: partial
Referee: [§5] §5 (Experiments): The reported competitive performance of DA-AC versus TD3 and other baselines would be strengthened by an ablation that keeps the critic network, replay buffer, and hyper-parameters fixed while toggling only the action representation (original vs. distribution parameters) and the use of ICL; without such isolation it remains unclear whether observed improvements stem from the DA-PG estimator itself.

Authors: We agree that a more tightly controlled ablation would strengthen the empirical claims. In the revised manuscript we will report an additional experiment in which the critic network architecture, replay buffer, optimizer settings, and all other hyperparameters are held exactly fixed while we toggle only the action representation (original actions versus distribution parameters) and the presence or absence of ICL. revision: yes

standing simulated objections not resolved

A formal variance bound for DA-PG that rigorously accounts for the increased dimensionality of the distribution-parameter vector

Circularity Check

0 steps flagged

No significant circularity in DA-PG derivation or framework

full rationale

The paper reparameterizes actions as distributions to create a continuous action space and applies the standard deterministic policy gradient theorem within that space to obtain DA-PG. This is a direct methodological extension rather than a result that reduces to its own inputs by construction. The lower-variance claim is asserted as a property of the reparameterized estimator; no equation or self-citation is shown that forces the variance reduction tautologically. ICL is introduced as an auxiliary technique with bandit insights, and the algorithm builds on the external TD3 baseline without load-bearing self-citations or fitted inputs renamed as predictions. The derivation chain remains independent of the target performance claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are identifiable; the framework implicitly assumes that distribution parameters can serve as a sufficient statistic for policy improvement and that ICL stabilizes critic learning.

pith-pipeline@v0.9.0 · 5686 in / 1101 out tokens · 53203 ms · 2026-05-19T08:24:39.482407+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the parameters-as-actions framework... the agent outputs distribution parameters ¯πθ(s) as its action... DPPG estimator... lower variance than the gradient in the original action space.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.2 (Distribution parameter policy gradient theorem)... ∇θJ(¯πθ) = Es∼d¯πθ [∇θ¯πθ(s)∇u¯q¯πθ(s,u)|u=¯πθ(s)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.