pith. sign in

arxiv: 2406.05358 · v3 · submitted 2024-06-08 · 💻 cs.LG · math.OC

Reinforcement Learning for Intensity Control: An Application to Choice-Based Network Revenue Management

Pith reviewed 2026-05-24 00:33 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords reinforcement learningintensity controlnetwork revenue managementcontinuous-timeactor-criticpolicy gradientevent-driven learning
0
0 comments X

The pith

A continuous-time reinforcement learning framework for intensity control adapts discrete-time algorithms using event-driven state jumps without pre-discretizing time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a practical reinforcement learning method for continuous-time intensity control problems like choice-based network revenue management. It shows that the natural discretization from state jumps allows adapting Monte Carlo, temporal difference, and policy gradient methods directly to continuous time. This is useful because discretization can be inefficient or inaccurate in large or changing environments, and the experiments indicate better performance and scalability than discretization-based alternatives.

Core claim

By leveraging the event-driven structure of intensity control and the discretization created by state-jump times, discrete-time RL algorithms can be adapted to continuous time without advance discretization of the time horizon, leading to policy evaluation and actor-critic methods that scale well and perform better than benchmarks, especially in non-stationary settings.

What carries the argument

The event-driven structure of the problem, which creates inherent discretization via state-jump times, allowing adaptation of discrete-time algorithms to continuous time.

If this is right

  • The continuous-time approach delivers significantly superior performance compared to discretization-based reinforcement learning methods while maintaining comparable computational efficiency.
  • This advantage is particularly pronounced in highly non-stationary environments.
  • The method effectively scales to large-scale problems.
  • Actor-critic algorithms based on policy gradients can be developed for event-driven intensity control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar event-driven adaptations could apply to other continuous-time stochastic control problems such as queueing systems.
  • The framework might simplify implementation for practitioners who already use discrete RL tools on continuous processes.
  • Testing on even larger instances or real data could further validate the scalability claims.

Load-bearing premise

The event-driven structure of intensity control problems permits direct adaptation of discrete-time Monte Carlo, temporal difference, and policy-gradient algorithms to continuous time without requiring advance discretization of the time horizon.

What would settle it

Running the proposed continuous-time actor-critic method on a highly non-stationary large-scale revenue management problem and finding it does not outperform a carefully tuned discretization-based RL baseline in terms of revenue or computation time.

Figures

Figures reproduced from arXiv: 2406.05358 by Huiling Meng, Ningyuan Chen, Xuefeng Gao.

Figure 1
Figure 1. Figure 1: Average revenue of Algorithm 1 as the iterations progress for the example in Section 6.2 [PITH_FULL_IMAGE:figures/full_fig_p026_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Airline Network This example considers a medium-sized airline network consisting of 6 flight legs with a total of 9 products (including local and connecting itineraries). The booking horizon is set to T = 200 time units [PITH_FULL_IMAGE:figures/full_fig_p027_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average revenue of Algorithm 1 as the iterations progress for the example in Section 6.3 [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗
read the original abstract

Intensity control is a class of continuous-time dynamic optimization problems with many important applications in Operations Research including queueing and revenue management. In this study, we propose a practical continuous-time reinforcement learning framework for intensity control using choice-based network revenue management as a case study, which is a classical problem in revenue management that features a large state space, a large action space, and a continuous time horizon. We show that by leveraging the event-driven structure of the problem and the inherent discretization of sample paths created by the state-jump times, a defining feature of intensity control, one does not need to discretize the time horizon in advance. We adapt discrete-time Monte Carlo and temporal difference learning algorithms for policy evaluation to continuous time and develop policy-gradient-based actor-critic algorithms for event-driven intensity control. Through a comprehensive numerical study, we evaluate the proposed approach against various state-of-the-art benchmarks, demonstrating its overall superior performance and effective scalability to large-scale problems. Notably, compared to discretization-based reinforcement learning methods, our continuous-time approach delivers significantly superior performance while maintaining comparable computational efficiency. This advantage is particularly pronounced in highly non-stationary environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a continuous-time RL framework for intensity control problems, with choice-based network revenue management as the running example. It adapts standard discrete-time algorithms (Monte Carlo, temporal-difference, and policy-gradient actor-critic) to the continuous-time setting by exploiting the event-driven jump times that naturally discretize sample paths, thereby avoiding any a-priori time-grid discretization. A numerical study is reported that claims the resulting methods outperform a range of state-of-the-art benchmarks, with the advantage being especially pronounced in non-stationary regimes, while retaining comparable computational cost and scaling to large instances.

Significance. If the numerical evidence is robust, the work would supply a practical route for applying RL to a broad class of continuous-time intensity-control problems that arise in revenue management and queueing, without incurring the approximation error or computational overhead of fixed-horizon discretization. The explicit construction of event-driven Monte Carlo and TD estimators, together with the actor-critic extension, is a concrete methodological contribution that could be reused in other jump-process settings.

major comments (1)
  1. [Abstract and §5] Abstract and §5 (Numerical Study): the central claim that the continuous-time approach 'delivers significantly superior performance' rests on a numerical study whose experimental design, effect sizes, confidence intervals, and statistical tests are not described in the abstract and are only alluded to at a high level in the text. Because this performance comparison is the primary empirical support for the method, the absence of these details is load-bearing for the paper's main conclusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive report. We address the single major comment below and will incorporate the suggested clarifications in the revision.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Numerical Study): the central claim that the continuous-time approach 'delivers significantly superior performance' rests on a numerical study whose experimental design, effect sizes, confidence intervals, and statistical tests are not described in the abstract and are only alluded to at a high level in the text. Because this performance comparison is the primary empirical support for the method, the absence of these details is load-bearing for the paper's main conclusion.

    Authors: We agree that the numerical evidence is the primary support for the central claim and that the current presentation of experimental details is insufficient. In the revised manuscript we will expand §5 with a dedicated subsection on the experimental protocol. This will include: (i) the number of independent replications and random seeds used for each instance, (ii) the precise performance metric (e.g., relative revenue improvement) together with effect-size reporting, (iii) 95% confidence intervals computed across replications, and (iv) any hypothesis tests employed to compare methods. We will also revise the abstract to add one sentence summarizing the scale and statistical support of the reported gains. These additions will make the empirical claims fully reproducible and transparent without altering the underlying results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method adapts standard RL algorithms to event-driven structure with empirical validation

full rationale

The paper's core contribution is the adaptation of existing discrete-time Monte Carlo, TD, and policy-gradient RL algorithms to continuous-time intensity control problems by exploiting the natural event-driven discretization from state jumps, without advance time discretization. This is a direct methodological mapping rather than a derivation that reduces to fitted inputs or self-referential definitions. Performance claims rest on numerical comparisons to external benchmarks, not on quantities defined in terms of the target metrics. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the provided material; the approach is self-contained against standard RL literature and external baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard modeling assumptions for intensity control and continuous-time MDPs; no new entities are postulated and free parameters are the usual RL hyperparameters whose values are not detailed in the abstract.

free parameters (1)
  • RL hyperparameters (learning rates, discount factors, etc.)
    Standard in actor-critic and policy-gradient methods; values not specified in abstract.
axioms (1)
  • domain assumption Intensity control problems admit an event-driven continuous-time Markov decision process formulation.
    Invoked when stating that the event-driven structure can be leveraged directly.

pith-pipeline@v0.9.0 · 5731 in / 1318 out tokens · 24512 ms · 2026-05-24T00:33:22.680616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Adelman, D. (2007). Dynamic bid prices in revenue management. Operations Research\/ 55\/ (4), 647--661

  2. [2]

    Balwally, G

    Azagirre, X., A. Balwally, G. Candeli, N. Chamandy, B. Han, A. King, H. Lee, M. Loncaric, S. Martin, V. Narasiman, et al. (2024). A better match for drivers and riders: Reinforcement learning at lyft. INFORMS Journal on Applied Analytics\/ 54\/ (1), 71--83

  3. [3]

    Bradtke, S. J. and M. O. Duff (1995). Reinforcement learning methods for continuous-time markov decision processes. Advances in Neural Information Processing Systems 7\/ 7 , 393

  4. [4]

    Br \'e maud, P. (1981). Point processes and queues. Springer\/

  5. [5]

    Chen, H. and D. D. Yao (1990). Optimal intensity control of a queueing system with state-dependent capacity limit. IEEE Transactions on Automatic Control\/ 35\/ (4), 459--464

  6. [6]

    Dai, J. G. and M. Gluzman (2022). Queueing network controls via deep reinforcement learning. Stochastic Systems\/ 12\/ (1), 30--67

  7. [7]

    Dong, and Y

    Dai, M., Y. Dong, and Y. Jia (2023). Learning equilibrium mean-variance strategy. Mathematical Finance\/ 33\/ (4), 1166--1212

  8. [8]

    Das, T. K., A. Gosavi, S. Mahadevan, and N. Marchalleck (1999). Solving semi-markov decision problems using average reward reinforcement learning. Management Science\/ 45\/ (4), 560--574

  9. [9]

    Iyengar, R

    Gallego, G., G. Iyengar, R. Phillips, and A. Dubey (2004). Managing flexible products on a network. Working Paper\/

  10. [10]

    Topaloglu, et al

    Gallego, G., H. Topaloglu, et al. (2019). Revenue management and pricing analytics , Volume 209. Springer

  11. [11]

    Gallego, G. and G. Van Ryzin (1997). A multiproduct dynamic pricing problem and its applications to network yield management. Operations research\/ 45\/ (1), 24--41

  12. [12]

    Li, and X

    Gao, X., L. Li, and X. Y. Zhou (2024). Reinforcement learning for jump-diffusions. arXiv preprint arXiv:2405.16449\/

  13. [13]

    Gao, X. and X. Y. Zhou (2022a). Logarithmic regret bounds for continuous-time average-reward markov decision processes. arXiv preprint arXiv:2205.11168\/

  14. [14]

    Gao, X. and X. Y. Zhou (2022b). Square-root regret bounds for continuous-time episodic markov decision processes. arXiv preprint arXiv:2210.00832\/

  15. [15]

    Gijsbrechts, J., R. N. Boute, J. A. Van Mieghem, and D. J. Zhang (2022). Can deep reinforcement learning improve inventory management? P erformance on lost sales, dual-sourcing, and multi-echelon problems. Manufacturing & Service Operations Management\/ 24\/ (3), 1349--1368

  16. [16]

    Hern \'a ndez-Lerma, X

    Guo, X., O. Hern \'a ndez-Lerma, X. Guo, and O. Hern \'a ndez-Lerma (2009). Continuous-time Markov decision processes . Springer

  17. [17]

    Huang, and Y

    Guo, X., X. Huang, and Y. Huang (2015). Finite-horizon optimality for continuous-time M arkov decision processes with unbounded transition rates. Advances in Applied Probability\/ 47\/ (4), 1064--1087

  18. [18]

    Xu, and T

    Guo, X., R. Xu, and T. Zariphopoulou (2022). Entropy regularization for mean field games with learning. Mathematics of Operations research\/ 47\/ (4), 3239--3260

  19. [19]

    Haarnoja, T., A. Zhou, P. Abbeel, and S. Levine (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pp.\ 1861--1870. PMLR

  20. [20]

    Jia, Y. and X. Y. Zhou (2022a). Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research\/ 23\/ (154), 1--55

  21. [21]

    Jia, Y. and X. Y. Zhou (2022b). Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. The Journal of Machine Learning Research\/ 23\/ (1), 12603--12652

  22. [22]

    Jia, Y. and X. Y. Zhou (2023). q-learning in continuous time. Journal of Machine Learning Research\/ 24\/ (161), 1--61

  23. [23]

    Kingma, D. P. and J. Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980\/

  24. [24]

    Li, T., C. Wang, Y. Wang, and S. Tang (2023). Deep reinforcement learning for online assortment customization: A data-driven approach. Working Paper\/

  25. [25]

    Liu, Q. and G. Van Ryzin (2008). On the choice-based linear programming model for network revenue management. Manufacturing & Service Operations Management\/ 10\/ (2), 288--310

  26. [26]

    Rusmevichientong, M

    Ma, Y., P. Rusmevichientong, M. Sumida, and H. Topaloglu (2020). An approximation algorithm for network revenue management under nonstationary arrivals. Operations Research\/ 68\/ (3), 834--855

  27. [27]

    Nazari, L

    Oroojlooyjadid, A., M. Nazari, L. V. Snyder, and M. Tak \'a c (2022). A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manufacturing & Service Operations Management\/ 24\/ (1), 285--304

  28. [28]

    Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons

  29. [29]

    Robbins, H. and S. Monro (1951). A Stochastic Approximation Method . The Annals of Mathematical Statistics\/ 22\/ (3), 400 -- 407

  30. [30]

    Sinclair, S., T. Wang, G. Jain, S. Banerjee, and C. Yu (2020). Adaptive discretization for model-based reinforcement learning. Advances in Neural Information Processing Systems\/ 33 , 3858--3871

  31. [31]

    Sinclair, S. R., S. Banerjee, and C. L. Yu (2023). Adaptive discretization in online reinforcement learning. Operations Research\/ 71\/ (5), 1636--1652

  32. [32]

    Strauss, A. K., R. Klein, and C. Steinhardt (2018). A review of choice-based revenue management: Theory and methods. European journal of operational research\/ 271\/ (2), 375--387

  33. [33]

    Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction . MIT press

  34. [34]

    Blier, and Y

    Tallec, C., L. Blier, and Y. Ollivier (2019). Making deep Q -learning methods robust to time discretization. In International Conference on Machine Learning , pp.\ 6096--6104. PMLR

  35. [35]

    Talluri, K. and G. Van Ryzin (2004). Revenue management under a general discrete choice model of consumer behavior. Management Science\/ 50\/ (1), 15--33

  36. [36]

    Gao, and L

    Wang, B., X. Gao, and L. Li (2023). Reinforcement learning for continuous-time optimal execution: actor-critic algorithm and error analysis. Available at SSRN 4378950\/

  37. [37]

    Zariphopoulou, and X

    Wang, H., T. Zariphopoulou, and X. Y. Zhou (2020). Reinforcement learning in continuous time and space: A stochastic control approach. J. Mach. Learn. Res.\/ 21\/ (198), 1--34

  38. [38]

    Wang, H. and X. Y. Zhou (2020). Continuous-time mean--variance portfolio selection: A reinforcement learning framework. Mathematical Finance\/ 30\/ (4), 1273--1308

  39. [39]

    Wu, B. and L. Li (2024). Reinforcement learning for continuous-time mean-variance portfolio selection in a regime-switching market. Journal of Economic Dynamics and Control\/ 158 , 104787

  40. [40]

    Zhang, D. (2011). An improved dynamic programming decomposition approach for network revenue management. Manufacturing & Service Operations Management\/ 13\/ (1), 35--52

  41. [41]

    Zhang, D. and D. Adelman (2009). An approximate dynamic programming approach to network revenue management with customer choice. Transportation Science\/ 43\/ (3), 381--394

  42. [42]

    Zhang, D. and W. L. Cooper (2005). Revenue management for parallel flights with customer-choice behavior. Operations Research\/ 53\/ (3), 415--431

  43. [43]

    Tang, and D

    Zhao, H., W. Tang, and D. Yao (2024). Policy optimization for continuous reinforcement learning. Advances in Neural Information Processing Systems\/ 36