Reinforcement Learning for Intensity Control: An Application to Choice-Based Network Revenue Management
Pith reviewed 2026-05-24 00:33 UTC · model grok-4.3
The pith
A continuous-time reinforcement learning framework for intensity control adapts discrete-time algorithms using event-driven state jumps without pre-discretizing time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By leveraging the event-driven structure of intensity control and the discretization created by state-jump times, discrete-time RL algorithms can be adapted to continuous time without advance discretization of the time horizon, leading to policy evaluation and actor-critic methods that scale well and perform better than benchmarks, especially in non-stationary settings.
What carries the argument
The event-driven structure of the problem, which creates inherent discretization via state-jump times, allowing adaptation of discrete-time algorithms to continuous time.
If this is right
- The continuous-time approach delivers significantly superior performance compared to discretization-based reinforcement learning methods while maintaining comparable computational efficiency.
- This advantage is particularly pronounced in highly non-stationary environments.
- The method effectively scales to large-scale problems.
- Actor-critic algorithms based on policy gradients can be developed for event-driven intensity control.
Where Pith is reading between the lines
- Similar event-driven adaptations could apply to other continuous-time stochastic control problems such as queueing systems.
- The framework might simplify implementation for practitioners who already use discrete RL tools on continuous processes.
- Testing on even larger instances or real data could further validate the scalability claims.
Load-bearing premise
The event-driven structure of intensity control problems permits direct adaptation of discrete-time Monte Carlo, temporal difference, and policy-gradient algorithms to continuous time without requiring advance discretization of the time horizon.
What would settle it
Running the proposed continuous-time actor-critic method on a highly non-stationary large-scale revenue management problem and finding it does not outperform a carefully tuned discretization-based RL baseline in terms of revenue or computation time.
Figures
read the original abstract
Intensity control is a class of continuous-time dynamic optimization problems with many important applications in Operations Research including queueing and revenue management. In this study, we propose a practical continuous-time reinforcement learning framework for intensity control using choice-based network revenue management as a case study, which is a classical problem in revenue management that features a large state space, a large action space, and a continuous time horizon. We show that by leveraging the event-driven structure of the problem and the inherent discretization of sample paths created by the state-jump times, a defining feature of intensity control, one does not need to discretize the time horizon in advance. We adapt discrete-time Monte Carlo and temporal difference learning algorithms for policy evaluation to continuous time and develop policy-gradient-based actor-critic algorithms for event-driven intensity control. Through a comprehensive numerical study, we evaluate the proposed approach against various state-of-the-art benchmarks, demonstrating its overall superior performance and effective scalability to large-scale problems. Notably, compared to discretization-based reinforcement learning methods, our continuous-time approach delivers significantly superior performance while maintaining comparable computational efficiency. This advantage is particularly pronounced in highly non-stationary environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a continuous-time RL framework for intensity control problems, with choice-based network revenue management as the running example. It adapts standard discrete-time algorithms (Monte Carlo, temporal-difference, and policy-gradient actor-critic) to the continuous-time setting by exploiting the event-driven jump times that naturally discretize sample paths, thereby avoiding any a-priori time-grid discretization. A numerical study is reported that claims the resulting methods outperform a range of state-of-the-art benchmarks, with the advantage being especially pronounced in non-stationary regimes, while retaining comparable computational cost and scaling to large instances.
Significance. If the numerical evidence is robust, the work would supply a practical route for applying RL to a broad class of continuous-time intensity-control problems that arise in revenue management and queueing, without incurring the approximation error or computational overhead of fixed-horizon discretization. The explicit construction of event-driven Monte Carlo and TD estimators, together with the actor-critic extension, is a concrete methodological contribution that could be reused in other jump-process settings.
major comments (1)
- [Abstract and §5] Abstract and §5 (Numerical Study): the central claim that the continuous-time approach 'delivers significantly superior performance' rests on a numerical study whose experimental design, effect sizes, confidence intervals, and statistical tests are not described in the abstract and are only alluded to at a high level in the text. Because this performance comparison is the primary empirical support for the method, the absence of these details is load-bearing for the paper's main conclusion.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive report. We address the single major comment below and will incorporate the suggested clarifications in the revision.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Numerical Study): the central claim that the continuous-time approach 'delivers significantly superior performance' rests on a numerical study whose experimental design, effect sizes, confidence intervals, and statistical tests are not described in the abstract and are only alluded to at a high level in the text. Because this performance comparison is the primary empirical support for the method, the absence of these details is load-bearing for the paper's main conclusion.
Authors: We agree that the numerical evidence is the primary support for the central claim and that the current presentation of experimental details is insufficient. In the revised manuscript we will expand §5 with a dedicated subsection on the experimental protocol. This will include: (i) the number of independent replications and random seeds used for each instance, (ii) the precise performance metric (e.g., relative revenue improvement) together with effect-size reporting, (iii) 95% confidence intervals computed across replications, and (iv) any hypothesis tests employed to compare methods. We will also revise the abstract to add one sentence summarizing the scale and statistical support of the reported gains. These additions will make the empirical claims fully reproducible and transparent without altering the underlying results. revision: yes
Circularity Check
No significant circularity; method adapts standard RL algorithms to event-driven structure with empirical validation
full rationale
The paper's core contribution is the adaptation of existing discrete-time Monte Carlo, TD, and policy-gradient RL algorithms to continuous-time intensity control problems by exploiting the natural event-driven discretization from state jumps, without advance time discretization. This is a direct methodological mapping rather than a derivation that reduces to fitted inputs or self-referential definitions. Performance claims rest on numerical comparisons to external benchmarks, not on quantities defined in terms of the target metrics. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the provided material; the approach is self-contained against standard RL literature and external baselines.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL hyperparameters (learning rates, discount factors, etc.)
axioms (1)
- domain assumption Intensity control problems admit an event-driven continuous-time Markov decision process formulation.
Reference graph
Works this paper leans on
-
[1]
Adelman, D. (2007). Dynamic bid prices in revenue management. Operations Research\/ 55\/ (4), 647--661
work page 2007
-
[2]
Azagirre, X., A. Balwally, G. Candeli, N. Chamandy, B. Han, A. King, H. Lee, M. Loncaric, S. Martin, V. Narasiman, et al. (2024). A better match for drivers and riders: Reinforcement learning at lyft. INFORMS Journal on Applied Analytics\/ 54\/ (1), 71--83
work page 2024
-
[3]
Bradtke, S. J. and M. O. Duff (1995). Reinforcement learning methods for continuous-time markov decision processes. Advances in Neural Information Processing Systems 7\/ 7 , 393
work page 1995
-
[4]
Br \'e maud, P. (1981). Point processes and queues. Springer\/
work page 1981
-
[5]
Chen, H. and D. D. Yao (1990). Optimal intensity control of a queueing system with state-dependent capacity limit. IEEE Transactions on Automatic Control\/ 35\/ (4), 459--464
work page 1990
-
[6]
Dai, J. G. and M. Gluzman (2022). Queueing network controls via deep reinforcement learning. Stochastic Systems\/ 12\/ (1), 30--67
work page 2022
-
[7]
Dai, M., Y. Dong, and Y. Jia (2023). Learning equilibrium mean-variance strategy. Mathematical Finance\/ 33\/ (4), 1166--1212
work page 2023
-
[8]
Das, T. K., A. Gosavi, S. Mahadevan, and N. Marchalleck (1999). Solving semi-markov decision problems using average reward reinforcement learning. Management Science\/ 45\/ (4), 560--574
work page 1999
-
[9]
Gallego, G., G. Iyengar, R. Phillips, and A. Dubey (2004). Managing flexible products on a network. Working Paper\/
work page 2004
-
[10]
Gallego, G., H. Topaloglu, et al. (2019). Revenue management and pricing analytics , Volume 209. Springer
work page 2019
-
[11]
Gallego, G. and G. Van Ryzin (1997). A multiproduct dynamic pricing problem and its applications to network yield management. Operations research\/ 45\/ (1), 24--41
work page 1997
- [12]
- [13]
- [14]
-
[15]
Gijsbrechts, J., R. N. Boute, J. A. Van Mieghem, and D. J. Zhang (2022). Can deep reinforcement learning improve inventory management? P erformance on lost sales, dual-sourcing, and multi-echelon problems. Manufacturing & Service Operations Management\/ 24\/ (3), 1349--1368
work page 2022
-
[16]
Guo, X., O. Hern \'a ndez-Lerma, X. Guo, and O. Hern \'a ndez-Lerma (2009). Continuous-time Markov decision processes . Springer
work page 2009
-
[17]
Guo, X., X. Huang, and Y. Huang (2015). Finite-horizon optimality for continuous-time M arkov decision processes with unbounded transition rates. Advances in Applied Probability\/ 47\/ (4), 1064--1087
work page 2015
- [18]
-
[19]
Haarnoja, T., A. Zhou, P. Abbeel, and S. Levine (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pp.\ 1861--1870. PMLR
work page 2018
-
[20]
Jia, Y. and X. Y. Zhou (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research\/ 23\/ (154), 1--55
-
[21]
Jia, Y. and X. Y. Zhou (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. The Journal of Machine Learning Research\/ 23\/ (1), 12603--12652
-
[22]
Jia, Y. and X. Y. Zhou (2023). q-learning in continuous time. Journal of Machine Learning Research\/ 24\/ (161), 1--61
work page 2023
-
[23]
Kingma, D. P. and J. Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980\/
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[24]
Li, T., C. Wang, Y. Wang, and S. Tang (2023). Deep reinforcement learning for online assortment customization: A data-driven approach. Working Paper\/
work page 2023
-
[25]
Liu, Q. and G. Van Ryzin (2008). On the choice-based linear programming model for network revenue management. Manufacturing & Service Operations Management\/ 10\/ (2), 288--310
work page 2008
-
[26]
Ma, Y., P. Rusmevichientong, M. Sumida, and H. Topaloglu (2020). An approximation algorithm for network revenue management under nonstationary arrivals. Operations Research\/ 68\/ (3), 834--855
work page 2020
- [27]
-
[28]
Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons
work page 2014
-
[29]
Robbins, H. and S. Monro (1951). A Stochastic Approximation Method . The Annals of Mathematical Statistics\/ 22\/ (3), 400 -- 407
work page 1951
-
[30]
Sinclair, S., T. Wang, G. Jain, S. Banerjee, and C. Yu (2020). Adaptive discretization for model-based reinforcement learning. Advances in Neural Information Processing Systems\/ 33 , 3858--3871
work page 2020
-
[31]
Sinclair, S. R., S. Banerjee, and C. L. Yu (2023). Adaptive discretization in online reinforcement learning. Operations Research\/ 71\/ (5), 1636--1652
work page 2023
-
[32]
Strauss, A. K., R. Klein, and C. Steinhardt (2018). A review of choice-based revenue management: Theory and methods. European journal of operational research\/ 271\/ (2), 375--387
work page 2018
-
[33]
Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction . MIT press
work page 2018
-
[34]
Tallec, C., L. Blier, and Y. Ollivier (2019). Making deep Q -learning methods robust to time discretization. In International Conference on Machine Learning , pp.\ 6096--6104. PMLR
work page 2019
-
[35]
Talluri, K. and G. Van Ryzin (2004). Revenue management under a general discrete choice model of consumer behavior. Management Science\/ 50\/ (1), 15--33
work page 2004
-
[36]
Wang, B., X. Gao, and L. Li (2023). Reinforcement learning for continuous-time optimal execution: actor-critic algorithm and error analysis. Available at SSRN 4378950\/
work page 2023
-
[37]
Wang, H., T. Zariphopoulou, and X. Y. Zhou (2020). Reinforcement learning in continuous time and space: A stochastic control approach. J. Mach. Learn. Res.\/ 21\/ (198), 1--34
work page 2020
-
[38]
Wang, H. and X. Y. Zhou (2020). Continuous-time mean--variance portfolio selection: A reinforcement learning framework. Mathematical Finance\/ 30\/ (4), 1273--1308
work page 2020
-
[39]
Wu, B. and L. Li (2024). Reinforcement learning for continuous-time mean-variance portfolio selection in a regime-switching market. Journal of Economic Dynamics and Control\/ 158 , 104787
work page 2024
-
[40]
Zhang, D. (2011). An improved dynamic programming decomposition approach for network revenue management. Manufacturing & Service Operations Management\/ 13\/ (1), 35--52
work page 2011
-
[41]
Zhang, D. and D. Adelman (2009). An approximate dynamic programming approach to network revenue management with customer choice. Transportation Science\/ 43\/ (3), 381--394
work page 2009
-
[42]
Zhang, D. and W. L. Cooper (2005). Revenue management for parallel flights with customer-choice behavior. Operations Research\/ 53\/ (3), 415--431
work page 2005
-
[43]
Zhao, H., W. Tang, and D. Yao (2024). Policy optimization for continuous reinforcement learning. Advances in Neural Information Processing Systems\/ 36
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.