Joint Scheduling of Deferrable and Nondeferrable Demand with Colocated Stochastic Supply
Pith reviewed 2026-05-19 04:45 UTC · model grok-4.3
The pith
Under deterministic piecewise-linear retail prices, optimal deferrable demand scheduling reduces to three procrastination parameters per demand class.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under deterministic, time-varying, and piecewise-linear retail pricing, the optimal demand scheduling policy follows the Principle of Procrastination, which reduces the infinite-dimensional policy space to a finite-dimensional Euclidean space defined by three procrastination parameters for each deferrable demand.
What carries the argument
The Principle of Procrastination: the structural property that the optimal policy defers service according to three simple threshold parameters per deferrable demand class, turning the continuous-state MDP into a finite-dimensional optimization problem.
If this is right
- The policy search space shrinks from infinite-dimensional functions to a finite number of scalar parameters, one set of three per deferrable demand class.
- A Procrastination Threshold Reinforcement Learning algorithm can learn the parameters from samples when arrival and supply distributions are unknown.
- Numerical tests on real-world data show the learned thresholds closely approximate the optimal policy and outperform standard benchmark schedulers.
Where Pith is reading between the lines
- The three-parameter reduction may extend to deadline-constrained resource allocation problems outside electricity, such as compute job scheduling or vehicle routing, whenever costs are piecewise linear.
- Implementation in a real smart-grid controller would require only storing and updating three numbers per demand class, making online recomputation feasible at scale.
- Varying the number of linear segments in the price function would test how the required number of procrastination parameters grows.
Load-bearing premise
The retail electricity price is deterministic, known in advance, and piecewise linear.
What would settle it
Solve the MDP explicitly for a small instance with non-piecewise-linear prices and check whether the resulting policy can still be represented exactly by three procrastination parameters per demand class.
Figures
read the original abstract
We investigate the problem of serving deferrable and nondeferrable electric demands with colocated stochastic supply and grid-imported electricity. Deferrable demands arrive randomly and can be delayed within their service deadlines. Nondeferrable demands are always present and must be served immediately, but the quantity served depends on the cost of electricity. Colocated supply is stochastic with zero marginal cost. It can be used to meet demand or exported to the grid to maximize profit. The stochasticity of demands and local supply makes optimal scheduling a Markov decision process with continuous (uncountable) state and action spaces. Under deterministic, time-varying, and piecewise-linear retail pricing of electricity, we show that the optimal demand scheduling follows the {\em Principle of Procrastination}, which reduces the infinite-dimensional policy space to a finite-dimensional Euclidean space defined by three procrastination parameters for each deferrable demand. For settings in which the underlying probability distributions are unknown, we propose a {\em Procrastination Threshold Reinforcement Learning} algorithm. Numerical experiments based on real-world test data confirm that the proposed threshold learning algorithm closely approximates the optimal policy and outperforms standard benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies joint scheduling of deferrable and nondeferrable electric demands served by colocated stochastic supply and grid imports under deterministic time-varying piecewise-linear retail pricing. It derives that the optimal policy obeys the Principle of Procrastination, which collapses the infinite-dimensional policy space of the underlying continuous-state MDP to a finite-dimensional parameterization consisting of three procrastination parameters per deferrable demand class. For unknown distributions the authors introduce a Procrastination Threshold Reinforcement Learning algorithm and report numerical experiments on real-world data showing that the learned policy closely approximates the optimum and outperforms standard benchmarks.
Significance. If the structural result is correct, the reduction from an uncountable policy space to three scalar parameters per demand class is a meaningful contribution to stochastic optimal control for energy systems; it directly enables both exact dynamic programming on the reduced space and the design of the proposed RL method. The explicit use of the piecewise-linear price assumption to obtain the procrastination property, together with the real-data validation, strengthens the practical relevance for smart-grid scheduling.
major comments (1)
- [Optimal policy derivation] The derivation of the Principle of Procrastination (abstract and the section presenting the optimal policy) must explicitly establish that the three procrastination parameters remain invariant to the realized stochastic supply state. Because the supply is observed before the scheduling decision and has zero marginal cost, the effective marginal cost of serving deferrable load at any instant is the minimum of the retail price and the opportunity cost of forgoing export; if this dependence is not shown to preserve the threshold structure, the claimed reduction to a supply-independent three-parameter policy may not hold.
minor comments (2)
- [Abstract] The abstract states that the policy is reduced to 'three procrastination parameters' but does not name or define them; adding a one-sentence definition or a forward reference to the equation that introduces them would improve clarity for readers.
- [Numerical experiments] Numerical experiments section: reporting the number of independent runs and standard-error bars on the performance metrics would allow readers to assess the statistical reliability of the claim that the learned policy 'closely approximates' the optimum.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review of our manuscript. The single major comment raises an important point about explicitly establishing the invariance of the procrastination parameters to the observed supply state, and we address it directly below.
read point-by-point responses
-
Referee: [Optimal policy derivation] The derivation of the Principle of Procrastination (abstract and the section presenting the optimal policy) must explicitly establish that the three procrastination parameters remain invariant to the realized stochastic supply state. Because the supply is observed before the scheduling decision and has zero marginal cost, the effective marginal cost of serving deferrable load at any instant is the minimum of the retail price and the opportunity cost of forgoing export; if this dependence is not shown to preserve the threshold structure, the claimed reduction to a supply-independent three-parameter policy may not hold.
Authors: We appreciate the referee highlighting the need for greater explicitness on this point. In the derivation of the Principle of Procrastination, the optimal policy for each deferrable demand class is characterized by three thresholds that dictate whether to serve the demand immediately or procrastinate, based on the remaining time to deadline and the time-varying piecewise-linear price segments. Although supply is observed and has zero marginal cost, the effective marginal cost for the procrastination decision equals the deterministic retail price when grid import is required or the (deterministic) forgone export revenue when local supply is used. Because both the retail price schedule and the export opportunity are deterministic and independent of the realized supply quantity, the comparison between current effective cost and expected future costs remains unchanged by the specific supply realization. The thresholds are therefore computed solely from the price function and deadline structure, rendering them invariant to the supply state. We acknowledge that the current manuscript states this invariance implicitly through the overall policy reduction but does not isolate it in a dedicated remark or lemma. In the revised version we will insert an explicit paragraph (or short lemma) immediately after the statement of the Principle of Procrastination that formally shows the supply-state independence of the three parameters per demand class, thereby confirming that the finite-dimensional parameterization is preserved. revision: yes
Circularity Check
Derivation is self-contained; no circularity in structural result or parameter learning.
full rationale
The paper formulates the problem as an MDP with continuous state-action spaces and derives the Principle of Procrastination as a structural property under the explicit assumptions of deterministic, time-varying, piecewise-linear retail pricing. This reduces the policy to three procrastination parameters per deferrable demand class via mathematical analysis of the optimality conditions rather than by redefining inputs or fitting to the target outcome. The subsequent Procrastination Threshold RL algorithm learns those parameters from data when distributions are unknown, which is a standard estimation step and does not presuppose the result. No load-bearing step reduces by construction to a self-citation, fitted input renamed as prediction, or ansatz smuggled via prior work; the central claim remains independent of the fitted values and is falsifiable against the pricing assumptions.
Axiom & Free-Parameter Ledger
free parameters (1)
- three procrastination parameters per deferrable demand class
axioms (2)
- domain assumption Retail electricity price is deterministic, time-varying, and piecewise linear
- standard math The joint process of random demand arrivals, stochastic supply, and nondeferrable demand is Markovian
Reference graph
Works this paper leans on
-
[1]
Scheduling power consumption with price uncertainty,
T. T. Kim and H. V . Poor, “Scheduling power consumption with price uncertainty,”IEEE Trans. Smart Grid, vol. 2, no. 3, pp. 519–527, 2011
work page 2011
-
[2]
Optimal deadline scheduling for electric vehicle charging with energy storage and random supply,
J. Jin, Y . Xu, and Z. Yang, “Optimal deadline scheduling for electric vehicle charging with energy storage and random supply,” Automatica, vol. 119, p. 109096, 2020
work page 2020
-
[3]
Optimal-cost scheduling of electrical vehicle charging under uncertainty,
Y . Zhou, D. K. Yau, P. You, and P. Cheng, “Optimal-cost scheduling of electrical vehicle charging under uncertainty,” IEEE Trans. on Smart Grid, vol. 9, no. 5, pp. 4547–4554, 2017
work page 2017
-
[4]
Dynamic scheduling for charging electric vehicles: A priority rule,
Y . Xu, F. Pan, and L. Tong, “Dynamic scheduling for charging electric vehicles: A priority rule,” IEEE Trans. Autom. Control, vol. 61, no. 12, pp. 4094–4099, 2016
work page 2016
-
[5]
Q.-S. Jia and J. Wu, “A structural property of charging scheduling policy for shared electric vehicles with wind power generation,” IEEE Trans. Control Sys. Technol., vol. 29, no. 6, pp. 2393–2405, 2021
work page 2021
-
[6]
Joint scheduling of deferrable demand and storage with random supply and processing rate limits,
J. Jin, L. Hao, Y . Xu, J. Wu, and Q.-S. Jia, “Joint scheduling of deferrable demand and storage with random supply and processing rate limits,” IEEE Trans. Autom. Control , vol. 66, no. 11, pp. 5506– 5513, 2020
work page 2020
-
[7]
Deadline scheduling as restless bandits,
Z. Yu, Y . Xu, and L. Tong, “Deadline scheduling as restless bandits,” IEEE Trans. Autom. Control , vol. 63, no. 8, pp. 2343–2358, 2018
work page 2018
-
[8]
Y . Ye, D. Qiu, J. Ward, and M. Abram, “Model-free real-time au- tonomous energy management for a residential multi-carrier energy system: A deep reinforcement learning approach,” in Proc. 29th Int. Conf. Artif. Intell. , 2021, pp. 339–346
work page 2021
-
[9]
Optimizing home energy management and electric vehicle charging with reinforcement learning,
D. Wu, G. Rabusseau, V . Franc ¸ois-lavet, D. Precup, and B. Boulet, “Optimizing home energy management and electric vehicle charging with reinforcement learning,”Proc. 16th Adaptive Learn. Agents, 2018
work page 2018
-
[10]
Model-free real-time EV charging scheduling based on deep reinforcement learning,
Z. Wan, H. Li, H. He, and D. Prokhorov, “Model-free real-time EV charging scheduling based on deep reinforcement learning,” IEEE Trans. Smart Grid , vol. 10, no. 5, pp. 5246–5257, 2018
work page 2018
-
[11]
On-line building energy optimization using deep reinforcement learning,
E. Mocanu, D. C. Mocanu, P. H. Nguyen, A. Liotta, M. E. Webber, M. Gibescu, and J. G. Slootweg, “On-line building energy optimization using deep reinforcement learning,” IEEE Trans. Smart Grid , vol. 10, no. 4, pp. 3698–3708, 2018
work page 2018
-
[12]
L. Yang, G. Chen, and X. Cao, “A deep reinforcement learning-based charging scheduling approach with augmented lagrangian for electric vehicles,” Applied Energy, vol. 378, p. 124706, 2025
work page 2025
-
[13]
Constrained ev charging scheduling based on safe deep reinforcement learning,
H. Li, Z. Wan, and H. He, “Constrained ev charging scheduling based on safe deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2427–2439, 2019
work page 2019
-
[14]
Residential demand response using reinforcement learning,
D. O’Neill, M. Levorato, A. Goldsmith, and U. Mitra, “Residential demand response using reinforcement learning,” in 2010 First IEEE international conference on smart grid communications . IEEE, 2010, pp. 409–414
work page 2010
-
[15]
Online rein- forcement learning of optimal threshold policies for Markov decision processes,
A. Roy, V . Borkar, A. Karandikar, and P. Chaporkar, “Online rein- forcement learning of optimal threshold policies for Markov decision processes,” IEEE Trans. Autom. Control, vol. 67, no. 7, pp. 3722–3729, 2021
work page 2021
-
[16]
H. Park, D. G. Choi, and D. Min, “Adaptive inventory replenishment using structured reinforcement learning by exploiting a policy struc- ture,” Int. J. Prod. Econ. , vol. 266, p. 109029, 2023
work page 2023
-
[17]
DeepTOP3: Deep threshold-optimal policy for mdps and rmabs,
K. Nakhleh, I. Hou et al., “DeepTOP3: Deep threshold-optimal policy for mdps and rmabs,” Advances in Neural Inf. Processing Sys., vol. 35, pp. 28 734–28 746, 2022
work page 2022
-
[18]
L. Hao, J. Jin, and Y . Xu, “Laxity differentiated pricing and deadline differentiated threshold scheduling for a public electric vehicle charg- ing station,” IEEE Trans. Ind. Inform. , vol. 18, no. 9, pp. 6192–6202, 2022
work page 2022
-
[19]
W. Tang and Y . J. Zhang, “A model predictive control approach for low-complexity electric vehicle charging scheduling: Optimality and scalability,” IEEE transactions on power systems , vol. 32, no. 2, pp. 1050–1063, 2016
work page 2016
-
[20]
N. Mignoni, R. Carli, and M. Dotoli, “Distributed noncooperative mpc for energy scheduling of charging and trading electric vehicles in energy communities,” IEEE Trans. on Control Sys. Technol. , vol. 31, no. 5, pp. 2159–2172, 2023
work page 2023
-
[21]
A. Ito, A. Kawashima, T. Suzuki, S. Inagaki, T. Yamaguchi, and Z. Zhou, “Model predictive charging control of in-vehicle batteries for home energy management based on vehicle state prediction,” IEEE Trans. Control Sys. Technol., vol. 26, no. 1, pp. 51–64, 2017
work page 2017
-
[22]
Two-stage economic operation of microgrid-like electric vehicle parking deck,
Y . Guo, J. Xiong, S. Xu, and W. Su, “Two-stage economic operation of microgrid-like electric vehicle parking deck,” IEEE Trans. Smart Grid, vol. 7, no. 3, pp. 1703–1712, 2015
work page 2015
-
[23]
MPC-based appliance scheduling for residential building energy management controller,
C. Chen, J. Wang, Y . Heo, and S. Kishore, “MPC-based appliance scheduling for residential building energy management controller,” IEEE Trans. Smart Grid , vol. 4, no. 3, pp. 1401–1410, 2013
work page 2013
-
[24]
Modeling and stochastic control for home energy management,
Z. Yu, L. Jia, M. C. Murphy-Hoye, A. Pratt, and L. Tong, “Modeling and stochastic control for home energy management,” IEEE Trans. Smart Grid, vol. 4, no. 4, pp. 2244–2255, 2013
work page 2013
-
[25]
D. T. Nguyen and L. B. Le, “Joint optimization of electric vehicle and home energy scheduling considering user comfort preference,” IEEE Trans. Smart Grid , vol. 5, no. 1, pp. 188–199, 2013
work page 2013
-
[26]
Power control framework for green data centers,
T. Yang, Y . Hou, Y . C. Lee, H. Ji, and A. Y . Zomaya, “Power control framework for green data centers,” IEEE Trans. on Cloud Comput. , vol. 10, no. 4, pp. 2876–2886, 2020
work page 2020
-
[27]
Toward optimal operation of internet data center microgrid,
J. Li and W. Qi, “Toward optimal operation of internet data center microgrid,” IEEE Trans. on Smart Grid , vol. 9, no. 2, pp. 971–979, 2016
work page 2016
-
[28]
M. Chen, Z. Shen, L. Wang, and G. Zhang, “Intelligent energy schedul- ing in renewable integrated microgrid with bidirectional electricity-to- hydrogen conversion,” IEEE Trans on Netw. Sci. and Eng. , vol. 9, no. 4, pp. 2212–2223, 2022
work page 2022
-
[29]
Renewable-Colocated Green Hydrogen Production: Optimal Scheduling and Profitability
S. Li, L. Tong, T. Mount, K. Upadhyay, H. Eisenhardt, and P. Kumar, “Renewable-colocated green hydrogen production: Optimal scheduling and profitability,” arXiv preprint arXiv:2504.18368 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
W. He, C. Cai, Q.-L. Han, X. Qing, W. Du, and F. Qian, “Optimal scheduling of a hydrogen-based microgrid for an industrial park: A reinforcement learning approach,” IEEE Trans. on Syst., Man, and Cybern.: Syst., 2025
work page 2025
-
[31]
On the optimality of procrastination policy for ev charging under net energy metering,
M. Jeon, L. Tong, and Q. Zhao, “On the optimality of procrastination policy for ev charging under net energy metering,” in Proc. 62nd IEEE Conf. Decision and Control (CDC ‘23). IEEE, 2023, pp. 1563–1568
work page 2023
-
[32]
Imputing a convex objective function,
A. Keshavarz, Y . Wang, and S. Boyd, “Imputing a convex objective function,” in 2011 IEEE Int. Symp. Intell. Control . IEEE, 2011, pp. 613–619
work page 2011
-
[33]
On net energy metering x: Optimal prosumer decisions, social welfare, and cross-subsidies,
A. S. Alahmed and L. Tong, “On net energy metering x: Optimal prosumer decisions, social welfare, and cross-subsidies,” IEEE Trans. Smart Grid, vol. 14, no. 2, pp. 1652–1663, 2022
work page 2022
-
[34]
Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” in Int. Conf. Mach. Learn. (ICML) . PMLR, 2018, pp. 1861– 1870
work page 2018
-
[35]
ACN-Data: Analysis and Applications of an Open EV Charging Dataset,
Z. J. Lee, T. Li, and S. H. Low, “ACN-Data: Analysis and Applications of an Open EV Charging Dataset,” in Proc. 10th Int. Conf. Future Energy Sys., ser. e-Energy ’19, Jun. 2019
work page 2019
-
[36]
“Pecan street dataset,” Available at www.pecanstreet.org/dataport/ (2022/11/01)
work page 2022
-
[37]
Continuous control with deep reinforcement learning
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforce- ment learning,” arXiv preprint arXiv:1509.02971 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[38]
Soft Actor-Critic Algorithms and Applications
T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel et al. , “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905 , 2018. APPENDIX A PROOF OF THEOREMS We use the following notations: • ¯Vt(yt) := Eg[Vt(yt, g)] • ¯Vt(yt, gt) := Eg′[Vt(yt, g′) | gt] • ∂+ y ¯Vt(yt), ∂ − y ¯Vt(yt):...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
gt < v t + 1T dt(Net consuming): By the first order condition, the optimal schedule of nondeferrable load is d∗ it = d+ i := min ¯di, ∂U −1 it (π+ t ) ∀i = 1, . . . , K. There are three possible cases for the right and left deriva- tive of ¯Vt+1(yt; gt): ∂− y ¯Vt+1(yt; gt) and ∂+ y ¯Vt+1(yt; gt)
-
[40]
Hence, by the first order condition of v, v∗ t = min{¯v, yt}
Case 1: When −π+ t − ∂+ y ¯Vt+1(0; gt) ≥ 0, by the monotonicity of the right derivative, −π+ t −∂+ y ¯Vt+1(yt−min{¯v, yt}; gt) ≥ 0, ∀ yt ≤ (T −t)¯v. Hence, by the first order condition of v, v∗ t = min{¯v, yt}. This is equivalent with the θ+ t (gt) = (T − t)¯v
-
[41]
Case 2: When −π+ t − ∂− y ¯Vt+1 (T − t)¯v; gt ≤ 0, by the monotonicity of the left derivative, −π+ t − ∂− y ¯Vt+1(yt; gt) ≤ 0, ∀ yt ≤ (T − t)¯v Hence, by the first order condition of v, v∗ t = 0, which is equivalent with θ+ t (gt) = (T − t)¯v
-
[42]
Case 3: If there exists χ+ ∈ 0, (T − t)¯v that satisfies −π+ t − ∂− y ¯Vt+1(χ+; gt) ≤ 0, −π+ t − ∂+ y ¯Vt+1(χ+; gt) ≥ 0, by the first order condition, an optimal decision is v∗ t = min ¯v, [yt − χ+]+ . To summarize, for gt < d+ t + min ¯v, [yt − θ+ t (gt)]+ , (d∗ t , v∗ t ) = d+ t , min ¯v, [yt − θ+ t (gt)]+
-
[43]
gt > v t + 1T dt (Net producing): By the first order condition, the optimal nondeferrabld load schedule is d∗ it = d− i := min{ ¯di, ∂U −1 it (π− t )} ∀ i = 1, . . . , K. As the previous case, there are three possible cases for ∂− y ¯Vt+1(yt; gt) and ∂+ y ¯Vt+1(yt; gt)
-
[44]
Hence, by the first order condition of v, v∗ t = min{¯v, yt}, which is equivalent with θ− t (gt) = 0
Case 1: When −π− t − ∂+ y ¯Vt+1(0; gt) ≥ 0, by the monotonicity of the right derivative, −π− t −∂+ y ¯Vt+1(yt−min{¯v, yt}; gt) ≥ 0, ∀ yt ≤ (T −t)¯v. Hence, by the first order condition of v, v∗ t = min{¯v, yt}, which is equivalent with θ− t (gt) = 0
-
[45]
Hence, by the first order condition of v, v∗ t = 0 where θ− t (gt) = (T − t)¯v
Case 2: When −π− t − ∂− y ¯Vt+1 (T − t)¯v; gt ≤ 0, by the monotonicity of the left derivative, −π− t − ∂− y ¯Vt+1(yt; gt) ≤ 0, ∀ yt ≤ (T − t)¯v. Hence, by the first order condition of v, v∗ t = 0 where θ− t (gt) = (T − t)¯v
-
[46]
By the first order condition, an optimal decision is v∗ t = min ¯v, [yt − χ−(gt)]+
Case 3: If there exists χ− ∈ [0, (T − t)¯v] that satisfies −π− t − ∂− y ¯Vt+1(χ−; gt) ≤ 0, −π− t − ∂+ y ¯Vt+1(χ−; gt) ≥ 0. By the first order condition, an optimal decision is v∗ t = min ¯v, [yt − χ−(gt)]+ . To summarize for gt > d − t + min ¯v, [yt − θ− t (gt)]+ , (d∗ t , v∗ t ) = d+ t , min ¯v, [yt − θ− t (gt)]+
-
[47]
gt = vt + 1T dt (Net-zero): We solve (9) with the constraint gt = vt + 1T dt. The problem becomes max (d,v)∈A,v+1T d=gt Ut(d) + ¯Vt+1(yt − v; gt) (22) Since the optimization problem above satisfies the Slater’s condition, the KKT condition is necessary and sufficient condition. Then, the Lagrangian of the (22) is L0 = Ut(d) + ν(gt − v − 1T d) + ¯Vt+1(yt −...
-
[48]
By the first order condition with respect to v, and assumption A3 −π− + q′(yT − v) > 0
v ≤ gT : The objective function is −π−(v − gT ) − q(yT − v). By the first order condition with respect to v, and assumption A3 −π− + q′(yT − v) > 0. Hence, v∗ T ≥ gT
-
[49]
v > g T : The objective function is −π+(v − gT ) − q(yT − v). By the first order condition, and assumption A3 −π+ + q′(yT − v) > 0, which implies v∗ T = min{yT , ¯v}. For θT = (T − T )¯v = 0, the procrastination charging rate (11) becomes v∗ T = ( yT , 0 < y T ≤ min{yT , gT } min{yT , ¯v}, min{yT , gT } ≤ yT ≤ ¯v = min{yt, ¯v}, and the Proposition 1 holds...
-
[50]
, gT ), consider a sequence of charging actions (˜vt,
yt ≤ (T − t)¯v + min{¯v, gt}: For given a sequence of realizations of DG, (gt, . . . , gT ), consider a sequence of charging actions (˜vt, . . . ,˜vT ) with ˜vt = min {yt, gt, ¯v} − δ > 0 for δ > 0. Consider another sequence of charging actions (v∗ t , . . . , v∗ T ) such that v∗ t = ˜vt + δ andPT τ=t v∗ t =PT τ=t ˜vt − δ. Let the cumulative reward under ...
-
[51]
,˜vT ) with ˜vt = yt − (T − t)¯v + δ < ¯v for δ > 0
For yt > (T − t)¯v + min{¯v, gt}: Suppose a sequence (˜vt, . . . ,˜vT ) with ˜vt = yt − (T − t)¯v + δ < ¯v for δ > 0. Another sequence (v∗ t , . . . , v∗ T ) with v∗ t = ˜vt −δ = yt −(T − t)¯v has the cumulative reward R∗ t that satisfies : R∗ t = TX τ=t −Pπ(v∗ t − gt) = −Pπ(˜vt − gt) + π+δ + TX τ=t+1 −Pπ(v∗ t − gt) ≥ −Pπ(˜vt − gt) + π+δ + TX τ=t+1 −Pπ(˜v...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.