Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

Dongyan (Lucy) Huo; Qiang Chen; Rushuai Yang; Yi Chen

arxiv: 2606.10979 · v1 · pith:G66LDEMEnew · submitted 2026-06-09 · 💻 cs.AI

Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

Yi Chen , Rushuai Yang , Qiang Chen , Dongyan (Lucy) Huo This is my paper

Pith reviewed 2026-06-27 13:35 UTC · model grok-4.3

classification 💻 cs.AI

keywords Bellman-Taylor score decodingMarkov decision processesstate-dependent feasible actionsdeep reinforcement learningaction decoderqueueing network controloptimality gap decomposition

0 comments

The pith

Bellman-Taylor score decoding moves policy learning to an unconstrained Euclidean space for MDPs whose feasible actions depend on state and implicit constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bellman-Taylor score decoding to address MDPs in which feasible actions vary with state and are defined implicitly by operational constraints. Policy optimization occurs inside a latent Euclidean score space while a separate decoder maps scores to feasible actions. The construction is motivated by a Taylor expansion of the optimal action-value function and permits use of standard deep reinforcement learning algorithms without back-propagation through the decoder. A performance guarantee decomposes the optimality gap into a structural approximation error arising from the expansion and an algorithmic learning error. The method is illustrated on a queueing-network control task where the learned policy recovers a state-dependent index-based dispatching rule.

Core claim

Motivated by a Taylor expansion of the optimal action-value function, Bellman-Taylor score decoding relocates policy learning into an unconstrained Euclidean score space for MDPs that possess state-dependent feasible action sets. An action decoder separately enforces feasibility, so that the induced latent-score MDP can be optimized by any standard DRL algorithm without differentiation through the decoder. The resulting optimality gap decomposes into a structural approximation error and an algorithmic learning error.

What carries the argument

Bellman-Taylor score decoding, which uses a Taylor expansion of the optimal action-value function to justify a latent-score MDP whose policies are decoded to feasible actions by a separate module.

If this is right

The optimality gap decomposes into a structural approximation error and an algorithmic learning error.
The induced latent-score MDP can be optimized by standard DRL algorithms without differentiating through the decoder.
In a queueing network control problem the learned policy recovers a state-dependent index-based dispatching rule.
Numerical experiments show near-optimal performance on small instances and considerable improvements over benchmarks on larger systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same score-decoder split could be applied to other operations-research MDPs whose constraints are implicit rather than explicit.
Higher-order terms in the Taylor expansion might be substituted for the linear term to shrink the structural error component.
The framework suggests a route for transferring off-the-shelf DRL libraries to constrained sequential decision problems without custom action-space engineering.

Load-bearing premise

A Taylor expansion of the optimal action-value function yields a sufficiently accurate structural approximation to justify moving policy optimization into an unconstrained Euclidean score space while a separate decoder enforces feasibility.

What would settle it

On a small MDP whose optimal value function is known exactly, compute the Taylor approximation error directly and the achieved optimality gap; if the gap is not equal to the sum of that structural error and a separately measured learning error, the claimed decomposition fails.

read the original abstract

Many Markov decision processes (MDPs) in operations research have feasible actions that are state dependent and defined implicitly by various operational constraints. These features make it difficult to use standard deep reinforcement learning (DRL) algorithms, whose action interfaces typically assume either a fixed finite action catalog or a simple Euclidean space. Motivated by a Taylor expansion of the optimal action-value function, we propose Bellman--Taylor score decoding, a framework that moves policy learning to a Euclidean score space while enforcing feasibility through an action decoder. The induced latent-score MDP then can be optimized by standard DRL algorithms without differentiating through the decoder. We provide a performance guarantee showing that the optimality gap of this approach decomposes into a structural approximation error and an algorithmic learning error. Lastly, we apply this framework to a queueing network control problem, where the policy essentially learns a state-dependent index-based dispatching rule. Numerical experiments show near-optimal performance in small instances and considerable improvements over benchmarks in larger systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical way to run standard DRL on MDPs with state-dependent implicit action constraints by shifting optimization to a Taylor-derived score space, but the decomposed optimality guarantee rests on an approximation whose error bound is not shown to be independent of the decoder.

read the letter

The main thing here is a construction that takes the optimal Q* , does a first-order Taylor step around some reference action, and treats the resulting linear term as a score that lives in Euclidean space. A separate decoder then maps any score back to a feasible action for the original MDP. Standard RL runs on the induced latent-score process without gradients through the decoder. They apply it to a queueing network and report that the learned policy behaves like a state-dependent index rule, with good numerical results on small cases and better performance than the benchmarks on bigger ones.

What works is the targeting of a genuine interface problem: many OR MDPs have feasible sets that change with state and are defined by constraints rather than a fixed catalog. Moving the learning into an unconstrained score space while keeping feasibility outside the optimizer is a clean separation. The queueing example is a reasonable test bed and the claim that it recovers index policies is plausible.

The soft spot is exactly the one flagged in the stress-test note. The performance guarantee splits the optimality gap into structural approximation error from the Taylor step plus learning error. For that split to be useful, the remainder term has to stay small and controllable regardless of how the decoder behaves or how complex the feasible sets get. The abstract motivates the expansion but gives no Lipschitz constants, no explicit remainder bound, and no argument that the decoder mapping preserves the approximation quality. Without those pieces the structural term could grow with state dimension or action-set size, which would make the decomposition less informative than claimed.

This is aimed at people who already work on constrained or structured RL and want to plug standard libraries into OR-style problems. It is worth sending to referees because the practical motivation is solid and the experiments exist; the theory section will need the missing bounds and assumptions spelled out before the guarantee can be taken at face value.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Bellman-Taylor score decoding for MDPs whose feasible action sets are state-dependent and implicitly defined by operational constraints. Motivated by a first-order Taylor expansion of the optimal action-value function Q*, the framework shifts policy optimization to an unconstrained Euclidean latent-score space while a separate decoder maps scores to feasible actions; the resulting latent-score MDP can be solved by standard DRL algorithms without back-propagating through the decoder. A performance guarantee is stated that decomposes the optimality gap into a structural approximation error (from the Taylor expansion) plus an algorithmic learning error. The method is instantiated on a queueing-network control problem, where the learned policy reduces to a state-dependent index-based dispatching rule, and numerical experiments report near-optimal performance on small instances and improvements over benchmarks on larger systems.

Significance. If the decomposed optimality-gap bound can be made rigorous with a controllable structural term, the framework would allow off-the-shelf DRL algorithms to be applied directly to a large class of constrained MDPs that arise in operations research, without requiring custom action-space representations. The queueing-network application supplies a concrete, falsifiable demonstration that the latent-score construction can recover interpretable dispatching policies.

major comments (1)

[§4] §4 (Performance Guarantee, Theorem on decomposed optimality gap): the structural approximation error is asserted to arise from a first-order Taylor expansion of Q*, yet no explicit remainder bound, no Lipschitz or C^2 assumptions on Q* that would control the second-order term uniformly across state-dependent feasible sets, and no argument that the decoder mapping preserves the approximation quality are supplied. Without such a bound independent of the decoder and policy, the claimed decomposition does not establish that the structural term remains small enough to justify near-optimality of the overall procedure.

minor comments (2)

[§3] Notation for the latent-score MDP and the decoder mapping is introduced without a compact summary table relating the original MDP components (S, A(s), P, r) to their latent counterparts; this makes it difficult to track which quantities are preserved exactly and which are approximated.
[§5] The experimental section reports “near-optimal performance” on small instances but does not state the precise optimality gap metric, the number of random seeds, or confidence intervals; adding these would strengthen reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need to strengthen the rigor of the performance guarantee. We address the concern point by point below.

read point-by-point responses

Referee: [§4] §4 (Performance Guarantee, Theorem on decomposed optimality gap): the structural approximation error is asserted to arise from a first-order Taylor expansion of Q*, yet no explicit remainder bound, no Lipschitz or C^2 assumptions on Q* that would control the second-order term uniformly across state-dependent feasible sets, and no argument that the decoder mapping preserves the approximation quality are supplied. Without such a bound independent of the decoder and policy, the claimed decomposition does not establish that the structural term remains small enough to justify near-optimality of the overall procedure.

Authors: We agree that the current formulation of the theorem in §4 states the decomposition into structural and algorithmic errors but does not supply the requested regularity conditions or remainder control. In the revision we will add the following: (i) a uniform C² assumption on Q* together with a Lipschitz condition on the decoder; (ii) an explicit second-order Taylor remainder bound that is uniform across the family of state-dependent feasible sets; and (iii) a short lemma showing that the decoder mapping perturbs the first-order approximation by an additive term controlled by the decoder Lipschitz constant. These changes will make the structural error term explicitly bounded and independent of the learned policy, thereby justifying the claimed near-optimality under the stated assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation derives guarantee from Taylor-motivated construction without reducing to inputs by definition or self-citation.

full rationale

Abstract and description present a framework motivated by first-order Taylor expansion of Q*, inducing a latent-score MDP optimized by standard DRL, with an optimality-gap decomposition into structural approximation error plus learning error. No equations or claims are quoted that define the structural term via the decoder output, fit parameters on evaluation data then relabel as prediction, or rely on self-citations for uniqueness or ansatz. The decomposition is presented as a derived bound rather than tautological; the paper is self-contained against external benchmarks with no load-bearing self-reference visible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, background axioms, or invented entities beyond the high-level concepts of score space and decoder; the Taylor expansion is invoked but its order and assumptions are unspecified.

invented entities (1)

latent-score MDP no independent evidence
purpose: Decouples policy learning from the feasibility decoder so that standard DRL algorithms can be applied directly
Introduced as the intermediate MDP in which optimization occurs; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5705 in / 1112 out tokens · 21313 ms · 2026-06-27T13:35:24.808229+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Proceedings of Robotics: Science and Systems (RSS) , year=

Aravind Rajeswaran and Vikash Kumar and Abhishek Gupta and Giulia Vezzani and John Schulman and Emanuel Todorov and Sergey Levine , title=. Proceedings of Robotics: Science and Systems (RSS) , year=
[2]

2014 , publisher=

Markov decision processes: discrete stochastic dynamic programming , author=. 2014 , publisher=

2014
[3]

Advances in neural information processing systems , volume=

Neural trust region/proximal policy optimization attains globally optimal policy , author=. Advances in neural information processing systems , volume=
[4]

Nature , volume=

Outracing champion Gran Turismo drivers with deep reinforcement learning , author=. Nature , volume=. 2022 , publisher=

2022
[5]

Playing Atari with Deep Reinforcement Learning

Playing atari with deep reinforcement learning , author=. arXiv preprint arXiv:1312.5602 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

International conference on machine learning , pages=

Constrained policy optimization , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[7]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[8]

Queueing Systems , volume=

A survey on skill-based routing with applications to service operations management , author=. Queueing Systems , volume=. 2020 , publisher=

2020
[9]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Deep reinforcement learning for robotics: A survey of real-world successes , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[10]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

1998
[11]

1957 , publisher=

Dynamic Programming , author=. 1957 , publisher=

1957
[12]

Inpatient Overflow Management with Proximal Policy Optimization

Inpatient overflow management with proximal policy optimization , author=. arXiv preprint arXiv:2410.13767 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Encyclopedia of optimization , pages =

Neuro-dynamic programming , author =. Encyclopedia of optimization , pages =. 2025 , publisher =

2025
[14]

2011 , publisher=

Approximate Dynamic Programming: Solving the Curses of Dimensionality , author=. 2011 , publisher=

2011
[15]

Nature , volume=

Human-level control through deep reinforcement learning , author=. Nature , volume=. 2015 , publisher=

2015
[16]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Machine Learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine Learning , volume=. 1992 , publisher=

1992
[18]

arXiv preprint arXiv:2006.14171 , year=

A closer look at invalid action masking in policy gradient algorithms , author=. arXiv preprint arXiv:2006.14171 , year=

work page arXiv 2006
[19]

Journal of Machine Learning Research , volume=

Variance reduction techniques for gradient estimates in reinforcement learning , author=. Journal of Machine Learning Research , volume=
[20]

and Varaiya, P

Buyukkoc, C. and Varaiya, P. and Walrand, J. , title =. Advances in Applied Probability , volume =. 1985 , doi =

1985
[21]

29th IEEE Conference on Decision and Control , pages=

Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks , author=. 29th IEEE Conference on Decision and Control , pages=. 1990 , organization=

1990
[22]

The Annals of Applied Probability , volume=

Maxweight scheduling in a generalized switch: State space collapse and workload minimization in heavy traffic , author=. The Annals of Applied Probability , volume=. 2004 , publisher=

2004
[23]

Manufacturing & Service Operations Management , volume=

Inpatient overflow: An approximate dynamic programming approach , author=. Manufacturing & Service Operations Management , volume=. 2019 , publisher=

2019
[24]

arXiv preprint arXiv:2501.10523 , year=

Multiclass queue scheduling under slowdown: An approximate dynamic programming approach , author=. arXiv preprint arXiv:2501.10523 , year=

work page arXiv
[25]

, title =

Chandak, Yash and Theocharous, Georgios and Kostas, James and Jordan, Scott and Thomas, Philip S. , title =. International Conference on Machine Learning (ICML) , year =
[26]

Deep Reinforcement Learning in Large Discrete Action Spaces

Dulac-Arnold, Gabriel and Evans, Richard and van Hasselt, Hado and Sunehag, Peter and Lillicrap, Timothy and Hunt, Jonathan and Mann, Timothy and Weber, Theophane and Degris, Thomas and Coppin, Ben , title =. arXiv preprint arXiv:1512.07679 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[27]

AAAI Conference on Artificial Intelligence , year =

Tavakoli, Arash and Pardo, Fabio and Kormushev, Petar , title =. AAAI Conference on Artificial Intelligence , year =
[28]

International Conference on Learning Representations (ICLR) , year =

Hausknecht, Matthew and Stone, Peter , title =. International Conference on Learning Representations (ICLR) , year =
[29]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Li, Boyan and Tang, Hongyao and Zheng, Yan and Hao, Jianye and Li, Pengyi and Wang, Zhen and Meng, Zhaopeng and Wang, Li , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[30]

A Closer Look at Invalid Action Masking in Policy Gradient Algorithms , booktitle =

Huang, Shengyi and Onta. A Closer Look at Invalid Action Masking in Policy Gradient Algorithms , booktitle =. 2022 , doi =

2022
[31]

and Mannor, Shie , title =

Zahavy, Tom and Haroush, Matan and Merlis, Nadav and Mankowitz, Daniel J. and Mannor, Shie , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[32]

IEEE International Conference on Robotics and Automation (ICRA) , year =

Pham, Tu-Hoa and De Magistris, Giovanni and Tachibana, Ryuki , title =. IEEE International Conference on Robotics and Automation (ICRA) , year =
[33]

Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) , year =

Ling, Jiajing and Schuler, Moritz Lukas and Kumar, Akshat and Varakantham, Pradeep , title =. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) , year =
[34]

Structured Reinforcement Learning for Combinatorial Decision-Making , booktitle =

Hoppe, Heiko and Baty, L. Structured Reinforcement Learning for Combinatorial Decision-Making , booktitle =. 2025 , url =

2025
[35]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Delarue, Arthur and Anderson, Ross and Tjandraatmadja, Christian , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[36]

Manufacturing & Service Operations Management , volume =

Harsha, Pavithra and Jagmohan, Ashish and Kalagnanam, Jayant and Quanz, Brian and Singhvi, Divya , title =. Manufacturing & Service Operations Management , volume =. 2025 , doi =

2025
[37]

International Conference on Learning Representations (ICLR) , year =

Xu, Lily and Wilder, Bryan and Khalil, Elias Boutros and Tambe, Milind , title =. International Conference on Learning Representations (ICLR) , year =
[38]

Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning , booktitle =

Tkachuk, Volodymyr and Bakhtiari, Seyed Alireza and Kirschner, Johannes and Jusup, Matej and Bogunovic, Ilija and Szepesv. Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning , booktitle =. 2023 , pages =

2023
[39]

and Grigas, Paul , title =

Elmachtoub, Adam N. and Grigas, Paul , title =. Management Science , volume =. 2022 , doi =

2022
[40]

Journal of Artificial Intelligence Research , volume =

Mandi, Jayanta and Kotary, James and Berden, Senne and Mulamba, Maxime and Bucarey, Victor and Guns, Tias and Fioretto, Ferdinando , title =. Journal of Artificial Intelligence Research , volume =. 2024 , doi =

2024
[41]

Zico , title =

Amos, Brandon and Kolter, J. Zico , title =. International Conference on Machine Learning (ICML) , year =
[42]

Zico , title =

Agrawal, Akshay and Amos, Brandon and Barratt, Shane and Boyd, Stephen and Diamond, Steven and Kolter, J. Zico , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[43]

AAAI Conference on Artificial Intelligence , year =

Ferber, Aaron and Wilder, Bryan and Dilkina, Bistra and Tambe, Milind , title =. AAAI Conference on Artificial Intelligence , year =
[44]

Differentiation of Blackbox Combinatorial Solvers , booktitle =

Vlastelica Pogan. Differentiation of Blackbox Combinatorial Solvers , booktitle =
[45]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Berthet, Quentin and Blondel, Mathieu and Teboul, Olivier and Cuturi, Marco and Vert, Jean-Philippe and Bach, Francis , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[46]

Stochastic Systems , volume=

Queueing network controls via deep reinforcement learning , author=. Stochastic Systems , volume=. 2022 , publisher=

2022
[47]

Management Science , volume=

A primal-dual approach to constrained markov decision processes with applications to queue scheduling and inventory management , author=. Management Science , volume=. 2026 , publisher=

2026
[48]

Manufacturing & Service Operations Management , volume=

Can deep reinforcement learning improve inventory management? Performance on lost sales, dual-sourcing, and multi-echelon problems , author=. Manufacturing & Service Operations Management , volume=. 2022 , publisher=

2022

[1] [1]

Proceedings of Robotics: Science and Systems (RSS) , year=

Aravind Rajeswaran and Vikash Kumar and Abhishek Gupta and Giulia Vezzani and John Schulman and Emanuel Todorov and Sergey Levine , title=. Proceedings of Robotics: Science and Systems (RSS) , year=

[2] [2]

2014 , publisher=

Markov decision processes: discrete stochastic dynamic programming , author=. 2014 , publisher=

2014

[3] [3]

Advances in neural information processing systems , volume=

Neural trust region/proximal policy optimization attains globally optimal policy , author=. Advances in neural information processing systems , volume=

[4] [4]

Nature , volume=

Outracing champion Gran Turismo drivers with deep reinforcement learning , author=. Nature , volume=. 2022 , publisher=

2022

[5] [5]

Playing Atari with Deep Reinforcement Learning

Playing atari with deep reinforcement learning , author=. arXiv preprint arXiv:1312.5602 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

International conference on machine learning , pages=

Constrained policy optimization , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[7] [7]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

2018

[8] [8]

Queueing Systems , volume=

A survey on skill-based routing with applications to service operations management , author=. Queueing Systems , volume=. 2020 , publisher=

2020

[9] [9]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Deep reinforcement learning for robotics: A survey of real-world successes , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[10] [10]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

1998

[11] [11]

1957 , publisher=

Dynamic Programming , author=. 1957 , publisher=

1957

[12] [12]

Inpatient Overflow Management with Proximal Policy Optimization

Inpatient overflow management with proximal policy optimization , author=. arXiv preprint arXiv:2410.13767 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Encyclopedia of optimization , pages =

Neuro-dynamic programming , author =. Encyclopedia of optimization , pages =. 2025 , publisher =

2025

[14] [14]

2011 , publisher=

Approximate Dynamic Programming: Solving the Curses of Dimensionality , author=. 2011 , publisher=

2011

[15] [15]

Nature , volume=

Human-level control through deep reinforcement learning , author=. Nature , volume=. 2015 , publisher=

2015

[16] [16]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Machine Learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine Learning , volume=. 1992 , publisher=

1992

[18] [18]

arXiv preprint arXiv:2006.14171 , year=

A closer look at invalid action masking in policy gradient algorithms , author=. arXiv preprint arXiv:2006.14171 , year=

work page arXiv 2006

[19] [19]

Journal of Machine Learning Research , volume=

Variance reduction techniques for gradient estimates in reinforcement learning , author=. Journal of Machine Learning Research , volume=

[20] [20]

and Varaiya, P

Buyukkoc, C. and Varaiya, P. and Walrand, J. , title =. Advances in Applied Probability , volume =. 1985 , doi =

1985

[21] [21]

29th IEEE Conference on Decision and Control , pages=

Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks , author=. 29th IEEE Conference on Decision and Control , pages=. 1990 , organization=

1990

[22] [22]

The Annals of Applied Probability , volume=

Maxweight scheduling in a generalized switch: State space collapse and workload minimization in heavy traffic , author=. The Annals of Applied Probability , volume=. 2004 , publisher=

2004

[23] [23]

Manufacturing & Service Operations Management , volume=

Inpatient overflow: An approximate dynamic programming approach , author=. Manufacturing & Service Operations Management , volume=. 2019 , publisher=

2019

[24] [24]

arXiv preprint arXiv:2501.10523 , year=

Multiclass queue scheduling under slowdown: An approximate dynamic programming approach , author=. arXiv preprint arXiv:2501.10523 , year=

work page arXiv

[25] [25]

, title =

Chandak, Yash and Theocharous, Georgios and Kostas, James and Jordan, Scott and Thomas, Philip S. , title =. International Conference on Machine Learning (ICML) , year =

[26] [26]

Deep Reinforcement Learning in Large Discrete Action Spaces

Dulac-Arnold, Gabriel and Evans, Richard and van Hasselt, Hado and Sunehag, Peter and Lillicrap, Timothy and Hunt, Jonathan and Mann, Timothy and Weber, Theophane and Degris, Thomas and Coppin, Ben , title =. arXiv preprint arXiv:1512.07679 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

AAAI Conference on Artificial Intelligence , year =

Tavakoli, Arash and Pardo, Fabio and Kormushev, Petar , title =. AAAI Conference on Artificial Intelligence , year =

[28] [28]

International Conference on Learning Representations (ICLR) , year =

Hausknecht, Matthew and Stone, Peter , title =. International Conference on Learning Representations (ICLR) , year =

[29] [29]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Li, Boyan and Tang, Hongyao and Zheng, Yan and Hao, Jianye and Li, Pengyi and Wang, Zhen and Meng, Zhaopeng and Wang, Li , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[30] [30]

A Closer Look at Invalid Action Masking in Policy Gradient Algorithms , booktitle =

Huang, Shengyi and Onta. A Closer Look at Invalid Action Masking in Policy Gradient Algorithms , booktitle =. 2022 , doi =

2022

[31] [31]

and Mannor, Shie , title =

Zahavy, Tom and Haroush, Matan and Merlis, Nadav and Mankowitz, Daniel J. and Mannor, Shie , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[32] [32]

IEEE International Conference on Robotics and Automation (ICRA) , year =

Pham, Tu-Hoa and De Magistris, Giovanni and Tachibana, Ryuki , title =. IEEE International Conference on Robotics and Automation (ICRA) , year =

[33] [33]

Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) , year =

Ling, Jiajing and Schuler, Moritz Lukas and Kumar, Akshat and Varakantham, Pradeep , title =. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) , year =

[34] [34]

Structured Reinforcement Learning for Combinatorial Decision-Making , booktitle =

Hoppe, Heiko and Baty, L. Structured Reinforcement Learning for Combinatorial Decision-Making , booktitle =. 2025 , url =

2025

[35] [35]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Delarue, Arthur and Anderson, Ross and Tjandraatmadja, Christian , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[36] [36]

Manufacturing & Service Operations Management , volume =

Harsha, Pavithra and Jagmohan, Ashish and Kalagnanam, Jayant and Quanz, Brian and Singhvi, Divya , title =. Manufacturing & Service Operations Management , volume =. 2025 , doi =

2025

[37] [37]

International Conference on Learning Representations (ICLR) , year =

Xu, Lily and Wilder, Bryan and Khalil, Elias Boutros and Tambe, Milind , title =. International Conference on Learning Representations (ICLR) , year =

[38] [38]

Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning , booktitle =

Tkachuk, Volodymyr and Bakhtiari, Seyed Alireza and Kirschner, Johannes and Jusup, Matej and Bogunovic, Ilija and Szepesv. Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning , booktitle =. 2023 , pages =

2023

[39] [39]

and Grigas, Paul , title =

Elmachtoub, Adam N. and Grigas, Paul , title =. Management Science , volume =. 2022 , doi =

2022

[40] [40]

Journal of Artificial Intelligence Research , volume =

Mandi, Jayanta and Kotary, James and Berden, Senne and Mulamba, Maxime and Bucarey, Victor and Guns, Tias and Fioretto, Ferdinando , title =. Journal of Artificial Intelligence Research , volume =. 2024 , doi =

2024

[41] [41]

Zico , title =

Amos, Brandon and Kolter, J. Zico , title =. International Conference on Machine Learning (ICML) , year =

[42] [42]

Zico , title =

Agrawal, Akshay and Amos, Brandon and Barratt, Shane and Boyd, Stephen and Diamond, Steven and Kolter, J. Zico , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[43] [43]

AAAI Conference on Artificial Intelligence , year =

Ferber, Aaron and Wilder, Bryan and Dilkina, Bistra and Tambe, Milind , title =. AAAI Conference on Artificial Intelligence , year =

[44] [44]

Differentiation of Blackbox Combinatorial Solvers , booktitle =

Vlastelica Pogan. Differentiation of Blackbox Combinatorial Solvers , booktitle =

[45] [45]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Berthet, Quentin and Blondel, Mathieu and Teboul, Olivier and Cuturi, Marco and Vert, Jean-Philippe and Bach, Francis , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[46] [46]

Stochastic Systems , volume=

Queueing network controls via deep reinforcement learning , author=. Stochastic Systems , volume=. 2022 , publisher=

2022

[47] [47]

Management Science , volume=

A primal-dual approach to constrained markov decision processes with applications to queue scheduling and inventory management , author=. Management Science , volume=. 2026 , publisher=

2026

[48] [48]

Manufacturing & Service Operations Management , volume=

Can deep reinforcement learning improve inventory management? Performance on lost sales, dual-sourcing, and multi-echelon problems , author=. Manufacturing & Service Operations Management , volume=. 2022 , publisher=

2022