Priced Motion Through Optimal Faces: A Normal-Fan Geometry for Non-Stationary Adversarial MDPs

Kai Hidajat

arxiv: 2606.29092 · v1 · pith:OR44IFZ4new · submitted 2026-06-27 · 💻 cs.LG · cs.AI

Priced Motion Through Optimal Faces: A Normal-Fan Geometry for Non-Stationary Adversarial MDPs

Kai Hidajat This is my paper

Pith reviewed 2026-06-30 09:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords dynamic regretadversarial MDPsnormal fanoccupancy polytopeface-crossing pricenon-stationary environmentspolicy tracking

0 comments

The pith

Dynamic regret in non-stationary adversarial MDPs decomposes exactly into priced face motion plus within-face selection error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard measures of non-stationarity such as total loss variation do not capture the true cost to a learner, because some large loss movements leave the optimal policy unchanged while small movements can force a complete policy switch. It models reward non-stationarity as motion through the normal fan of the occupancy-measure polytope, where each cone corresponds to a fixed optimal face and crossing a boundary changes the face. The central quantity is the face-crossing price, the minimum regret incurred by staying on the previous optimal face after the loss has moved. For any learner that tracks the prior face, dynamic regret factors exactly into the sum of these intrinsic prices along the path plus any remaining error from suboptimal choices inside the current face. This decomposition separates consequential non-stationarity from harmless variation, even when the raw loss path length is arbitrarily large.

Core claim

Occupancy measures form a polytope whose normal fan partitions loss space into cones, each exposing one optimal face. A path of losses traces a path through this fan; motion inside a cone leaves the optimal face fixed, while crossing a wall changes it. The face-crossing price is the minimum regret of remaining on the old face under the new loss. For any learner that tracks the previous face, dynamic regret equals the total priced motion through the fan plus the within-face selection error.

What carries the argument

The normal fan of the occupancy polytope together with the face-crossing price, which quantifies the regret cost of moving from one cone to another.

If this is right

Loss variation of arbitrary magnitude can produce zero priced motion and therefore zero added regret if the path stays inside one cone.
Identical variation in one coordinate can produce horizon-scale differences in regret depending on whether the change crosses a cone boundary.
Regret bounds can be expressed solely in terms of the priced length of the path through the fan rather than the total length of the loss path.
Learners achieve low dynamic regret by tracking only the sequence of optimal faces rather than reacting to all loss changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithms could be designed to monitor estimated face crossings and switch only when the crossing price exceeds a threshold.
The same priced-motion decomposition may apply to other polyhedral decision problems where comparators lie on faces of a fixed feasible set.
In environments where face crossings are sparse, dynamic-regret bounds would scale with the number of crossings rather than total variation.

Load-bearing premise

Transitions stay fixed, so the set of occupancy measures remains the same polytope and every loss vector exposes a well-defined optimal face.

What would settle it

A concrete finite-horizon MDP, loss sequence, and face-tracking learner for which the observed dynamic regret differs from the sum of face-crossing prices plus within-face error by more than a vanishing additive term.

Figures

Figures reproduced from arXiv: 2606.29092 by Kai Hidajat.

**Figure 2.** Figure 2: The Bellman price and its causal consequence. Left, under a fixed new loss the optimal [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Left, the causal-anisotropy identity in the layered tree, where the measured [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: The advantage threshold of Section 6. Actions whose estimated advantage exceeds their confidence band lie outside the optimal face and lose mass, while actions whose confidence band crosses zero lie near a wall of the fan and retain mass pending further evidence. Q d1 d2 Ft ¡ 1 \ Ft = ; ¡ cross t > 0 (a) Disjoint faces positive price nt ¡ 1 nt µt Ft ¡ 1 Ft Q d1 d2 Ft ¡ 1 \ Ft ; ¡ cross t = 0 (b) Intersecti… view at source ↗

**Figure 5.** Figure 5: Whether a crossing is priced depends on the relative position of the old and new optimal [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Why two MDPs are needed. Inside a single fixed MDP, a one-coordinate loss change that [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Simplex normal fan. Left, mean loss variation and intrinsic face price by regime, with [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Layered tree. Predictive power of four candidate path-motion summaries for dynamic [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Degenerate faces. Within-face selection error against the dimension of the free face, over [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Best cumulative dynamic regret of four updates on four non-stationary gridworld scenarios, [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Gridworld over training, the thresholded update (teal) against mirror descent (gray), with [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

In a changing decision problem, standard dynamic-regret analyses have often equated the cost of non-stationarity to how far loss moves. However, it is simultaneously possible for a loss sequence to travel far and retain the same optimal policy, or for a small movement in loss to force the optimal policy to change completely. Thus, the size of the movement through loss variation, transition variation, or comparator path length describe the adversary's motion, but not the cost of that motion to the control problem. For a more faithful analytic interpretation, this paper develops a normal-fan geometry for finite-horizon adversarial MDPs with fixed transitions. Occupancy measures form a polytope, and each loss vector exposes an optimal face of that polytope. Non-stationarity in rewards is therefore a path through the normal fan, where motion inside one cone leaves the optimal face unchanged, while crossing a wall may carry regret. We pose the notion of a face-crossing price, which is the minimum regret incurred by remaining on the previous optimal face under the new loss. For any learner that tracks the previous face, dynamic regret decomposes exactly into intrinsic priced face motion plus within-face selection error. The resulting theory separates consequential from harmless non-stationarity, where loss variation can be arbitrarily large at zero price, and identical one-coordinate variation can hide horizon-scale differences in regret.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes dynamic regret in adversarial MDPs via normal-fan geometry on the occupancy polytope, defining a face-crossing price that splits regret into motion cost plus within-face error for face-tracking learners.

read the letter

The main takeaway is a geometric decomposition: under fixed transitions, the occupancy measures form a static polytope whose faces are exposed by loss vectors. Non-stationarity becomes a path through the normal fan, where staying inside one cone leaves the optimal face unchanged at zero extra price, but crossing a wall incurs the minimum regret of sticking to the old face under the new loss. For any learner that tracks the prior face, dynamic regret then splits exactly into that priced crossing cost plus the usual within-face selection error.

This is new in the dynamic-regret literature for MDPs. Prior work measures adversary motion by total variation or comparator path length; here the cost is tied directly to whether the optimal face changes. The construction cleanly separates harmless large loss shifts from small shifts that force policy changes, which is a useful distinction.

The argument follows from the polytope geometry once the price is defined as the min regret of remaining on the old face. The fixed-transition assumption is explicit and keeps everything well-defined. No internal contradiction appears in the stated setup.

The soft spot is that the abstract states the decomposition without showing the derivation steps or a worked example, so it is hard to check edge cases like multiple optimal faces or how the price behaves over a full horizon. The paper is purely theoretical with no reported checks against existing bounds.

This is for researchers already working on dynamic regret in MDPs who want a different language for non-stationarity. A reader comfortable with polyhedral methods will see the value quickly.

It deserves peer review. The framing is distinct enough that referees should examine whether the price definition delivers the claimed exact split in the full proofs.

Referee Report

0 major / 2 minor

Summary. The paper develops a normal-fan geometry for finite-horizon adversarial MDPs with fixed transitions, where occupancy measures form a polytope and loss vectors expose optimal faces. It defines face-crossing price as the minimum regret incurred by remaining on the previous optimal face under a new loss, and claims that for any learner tracking the previous face, dynamic regret decomposes exactly into intrinsic priced face motion plus within-face selection error. This separates consequential non-stationarity (face crossings) from harmless loss variation (motion inside a cone).

Significance. If the decomposition holds, the work supplies a geometrically precise accounting of non-stationarity cost that can be zero despite arbitrarily large loss changes when the optimal face is unchanged. The exact, definitionally grounded decomposition and the static-polytope assumption under fixed transitions are explicit strengths that enable falsifiable distinctions between motion types.

minor comments (2)

[Abstract] Abstract, paragraph 3: the decomposition is stated without a forward reference to the theorem or proposition that establishes it; adding such a pointer would improve readability.
The definition of face-crossing price (minimum regret of staying on the old face) is introduced without an accompanying small numerical example illustrating a zero-price versus positive-price crossing; a one-paragraph illustration in §2 or §3 would clarify the normal-fan geometry.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading, accurate summary of the normal-fan geometry and priced face-crossing decomposition, and for the positive significance assessment. The recommendation of minor revision is noted; no specific major comments were raised in the report.

Circularity Check

1 steps flagged

Decomposition reduces to definition of face-crossing price

specific steps

self definitional [Abstract]
"We pose the notion of a face-crossing price, which is the minimum regret incurred by remaining on the previous optimal face under the new loss. For any learner that tracks the previous face, dynamic regret decomposes exactly into intrinsic priced face motion plus within-face selection error."

The priced motion term is defined precisely as the regret incurred by staying on the old face; the claimed exact decomposition for face-tracking learners is therefore an immediate restatement of that definition rather than a derived equality that could be independently verified or falsified.

full rationale

The paper introduces a face-crossing price as the minimum regret of remaining on the prior optimal face, then states that dynamic regret decomposes exactly into priced face motion plus within-face error for learners that track the prior face. This equality holds by construction once the price is defined that way; no independent derivation or external benchmark is required for the equality to hold. The geometry of the normal fan and fixed-transition polytope provide context but do not alter the definitional character of the split. No other load-bearing steps exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger extracted from abstract only; no further parameters or entities are visible.

axioms (1)

domain assumption Occupancy measures form a polytope for finite-horizon MDPs with fixed transitions
Stated as the geometric foundation in abstract paragraph 3.

invented entities (1)

face-crossing price no independent evidence
purpose: Minimum regret incurred by remaining on the previous optimal face under the new loss
New scalar introduced to quantify the cost of crossing a normal-fan wall.

pith-pipeline@v0.9.1-grok · 5770 in / 1212 out tokens · 44737 ms · 2026-06-30T09:21:52.926664+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 38 canonical work pages · 9 internal anchors

[1]

Kakade, Jason D

Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. (arXiv:1908.00261),

work page arXiv 1908
[2]

Kakade, Jason D

doi: 10.48550/arXiv.1908.00261. URLhttp://arxiv.org/abs/1908.00261. Carlo Alfano, Rui Yuan, and Patrick Rebeschini. A novel framework for policy mirror descent with general parameterization and linear convergence. (arXiv:2301.13139),

work page doi:10.48550/arxiv.1908.00261 1908
[4]

doi: 10.1201/9781315140223

Routledge. doi: 10.1201/9781315140223. URL https://www.taylorfrancis.com/ books/9781315140223. Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175,

work page doi:10.1201/9781315140223
[5]

URL http://arxiv.org/abs/1307

doi: 10.1287/opre.2015.1408. URL http://arxiv.org/abs/1307

work page doi:10.1287/opre.2015.1408 2015
[6]

URL https: //epubs.siam.org/doi/10.1137/0331063

doi: 10.1137/0331063. URL https: //epubs.siam.org/doi/10.1137/0331063. Adrian Rivera Cardoso, He Wang, and Huan Xu. Large scale markov decision processes with changing rewards,

work page doi:10.1137/0331063
[7]

Large Scale Markov Decision Processes with Changing Rewards

URLhttps://arxiv.org/abs/1905.10649v1. Nicol`o Cesa-Bianchi and G ´abor Lugosi.Prediction, Learning, and Games

work page internal anchor Pith review Pith/arXiv arXiv 1905
[8]

org/abs/2006.14389v1

URL https://arxiv. org/abs/2006.14389v1. Robert Dadashi, Adrien Ali Ta¨ıga, Nicolas Le Roux, Dale Schuurmans, and Marc G. Bellemare. The value function polytope in reinforcement learning. (arXiv:1901.11524),

work page arXiv 2006
[10]

Nima Eshraghi and Ben Liang

doi: 10.14288/1.0044649. Nima Eshraghi and Ben Liang. Dynamic regret of online mirror descent for relatively smooth convex cost functions. (arXiv:2202.12843),

work page doi:10.14288/1.0044649
[11]

URL http://arxiv.org/abs/2202.12843

doi: 10.48550/arXiv.2202.12843. URL http://arxiv.org/abs/2202.12843. Eyal Even-Dar, Sham. M. Kakade, and Yishay Mansour. Online markov decision processes.Math- ematics of Operations Research, 34(3):726–736,

work page doi:10.48550/arxiv.2202.12843
[12]

Yingjie Fei, Zhuoran Yang, Zhaoran Wang, and Qiaomin Xie

URL https://www.jstor.org/ stable/40538442. Yingjie Fei, Zhuoran Yang, Zhaoran Wang, and Qiaomin Xie. Dynamic regret of policy optimization in non-stationary environments. (arXiv:2007.00148),

work page arXiv 2007
[13]

URLhttp://arxiv.org/abs/2007.00148

doi: 10.48550/arXiv.2007.00148. URLhttp://arxiv.org/abs/2007.00148. Eric C. Hall and Rebecca M. Willett. Dynamical models and tracking regret in online convex programming,

work page doi:10.48550/arxiv.2007.00148 2007
[14]

Dynamical Models and Tracking Regret in Online Convex Programming

URLhttps://arxiv.org/abs/1301.1254v1. 10 Wasim Huleihel, Soumyabrata Pal, and Ofer Shayevitz. Learning user preferences in non-stationary environments. (arXiv:2101.12506),

work page internal anchor Pith review Pith/arXiv arXiv
[15]

URL http:// arxiv.org/abs/2101.12506

doi: 10.48550/arXiv.2101.12506. URL http:// arxiv.org/abs/2101.12506. Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. (arXiv:1501.06225),

work page doi:10.48550/arxiv.2101.12506
[16]

doi: 10.48550/arXiv.1501. 06225. URLhttp://arxiv.org/abs/1501.06225. Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial mdps with bandit feedback and unknown transition,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1501
[17]

URL https://arxiv.org/abs/1912. 01192v5. S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning

1912
[18]

URL https://proceedings.neurips.cc/paper_ files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract. html. Jan Felix Kleuker, Aske Plaat, and Thomas Moerland. On the effect of regularization in policy mirror descent. (arXiv:2507.08718),

work page arXiv 2001
[19]

URL http: //arxiv.org/abs/2507.08718

doi: 10.48550/arXiv.2507.08718. URL http: //arxiv.org/abs/2507.08718. Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. (arXiv:2102.00135),

work page doi:10.48550/arxiv.2507.08718
[20]

arXiv (2023) https://doi.org/10.48550/arXiv

doi: 10.48550/arXiv. 2102.00135. URLhttp://arxiv.org/abs/2102.00135. Long-Fei Li, Peng Zhao, and Zhi-Hua Zhou. Dynamic regret of adversarial linear mixture mdps. Advances in Neural Information Processing Systems 36, pp. 60685–60711,

work page internal anchor Pith review doi:10.48550/arxiv
[21]

URL https://papers.nips.cc/paper_files/paper/2023/file/ becd02b89259774da2ede23116a80648-Paper-Conference.pdf

doi: 10.52202/ 075280-2650. URL https://papers.nips.cc/paper_files/paper/2023/file/ becd02b89259774da2ede23116a80648-Paper-Conference.pdf. Long-Fei Li, Peng Zhao, and Zhi-Hua Zhou. Near-optimal dynamic regret for adversarial linear mixture mdps, 2024a. URLhttps://arxiv.org/abs/2411.03107v1. Long-Fei Li, Peng Zhao, and Zhi-Hua Zhou. Improved algorithm for ...

work page arXiv 2023
[22]

URL https://doi.org/10.1007/ s11228-008-0077-9

doi: 10.1007/s11228-008-0077-9. URL https://doi.org/10.1007/ s11228-008-0077-9. Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Ba s ¸ar. Model-free non-stationary rl: Near-optimal regret and applications in multi-agent rl and inventory control. (arXiv:2010.03161),

work page doi:10.1007/s11228-008-0077-9 2010
[23]

URL http://arxiv.org/abs/ 2010.03161

doi: 10.48550/arXiv.2010.03161. URL http://arxiv.org/abs/ 2010.03161. Nikola Milosevic and Nico Scherf. The geometry of nonlinear reinforcement learning. (arXiv:2509.01432),

work page doi:10.48550/arxiv.2010.03161 2010
[24]

URL http://arxiv.org/ abs/2509.01432

doi: 10.48550/arXiv.2509.01432. URL http://arxiv.org/ abs/2509.01432. Johannes M¨uller and Guido Mont´ufar. Geometry and convergence of natural policy gradient methods. Information Geometry, 7(S1):485–523,

work page doi:10.48550/arxiv.2509.01432
[25]

URL http: //arxiv.org/abs/2211.02105

doi: 10.1007/s41884-023-00106-z. URL http: //arxiv.org/abs/2211.02105. Arkadi˘ı Semenovich Nemirovski˘ı and David Berkovich IUdin.Problem Complexity and Method Efficiency in Optimization. Wiley,

work page doi:10.1007/s41884-023-00106-z
[26]

Andreas Paffenholz

URL https://papers.nips.cc/paper_files/paper/2010/ hash/7bb060764a818184ebb1cc0d43d382aa-Abstract.html. Andreas Paffenholz. Polyhedral geometry and linear optimization

2010
[27]

Online Learning with Predictable Sequences

Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. (arXiv:1208.3728),

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Online Learning with Predictable Sequences

doi: 10.48550/arXiv.1208.3728. URL http://arxiv.org/abs/ 1208.3728. R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1208.3728
[29]

URLhttps://arxiv.org/abs/1905.07773v1. J. Ben Schafer, Joseph Konstan, and John Riedl. Recommender systems in e-commerce. In Proceedings of the 1st ACM conference on Electronic commerce, EC ’99, pp. 158–166, New York, NY , USA,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[30]

doi: 10.1145/336992.337035

Association for Computing Machinery. doi: 10.1145/336992.337035. URL https://dl.acm.org/doi/10.1145/336992.337035. Andreas Schlaginhaufen and Maryam Kamgarpour. Identifiability and generalizability in constrained inverse reinforcement learning. (arXiv:2306.00629),

work page doi:10.1145/336992.337035
[31]

URL http://arxiv.org/abs/2306.00629

doi: 10.48550/arXiv.2306.00629. URL http://arxiv.org/abs/2306.00629. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. (arXiv:1502.05477),

work page doi:10.48550/arxiv.2306.00629
[32]

Trust Region Policy Optimization

doi: 10.48550/arXiv.1502.05477. URL http: //arxiv.org/abs/1502.05477. Guy Shani, Ronen I. Brafman, and David Heckerman. An mdp-based recommender system. (arXiv:1301.0600),

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1502.05477
[33]

An MDP-based Recommender System

doi: 10.48550/arXiv.1301.0600. URL http://arxiv.org/abs/ 1301.0600. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1301.0600
[34]

Mirror descent policy optimization

Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. (arXiv:2005.09814),

work page arXiv 2005
[35]

Mirror descent policy optimization

doi: 10.48550/arXiv.2005.09814. URL http:// arxiv.org/abs/2005.09814. Joel A. Tropp. Acm 204: Lectures on convex geometry

work page doi:10.48550/arxiv.2005.09814 2005
[36]

URL https://resolver.caltech.edu/CaltechAUTHORS:20220412-220319430

doi: 10.7907/GEDA-H205. URL https://resolver.caltech.edu/CaltechAUTHORS:20220412-220319430. Chen-Yu Wei and Haipeng Luo. Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. (arXiv:2102.05406),

work page doi:10.7907/geda-h205
[37]

URL http://arxiv.org/abs/2102.05406

doi: 10.48550/arXiv.2102.05406. URL http://arxiv.org/abs/2102.05406. Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. (arXiv:2105.11066),

work page doi:10.48550/arxiv.2102.05406
[38]

URL http://arxiv.org/abs/ 2105.11066

doi: 10.48550/arXiv.2105.11066. URL http://arxiv.org/abs/ 2105.11066. Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. (arXiv:2007.03479),

work page doi:10.48550/arxiv.2105.11066 2007
[39]

URL http://arxiv

doi: 10.48550/arXiv.2007.03479. URL http://arxiv. org/abs/2007.03479. Peng Zhao, Long-Fei Li, and Zhi-Hua Zhou. Dynamic regret of online markov decision processes,

work page doi:10.48550/arxiv.2007.03479 2007
[40]

Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou

URLhttps://arxiv.org/abs/2208.12483v1. Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Adaptivity and non-stationarity: Problem- dependent dynamic regret for online convex optimization. (arXiv:2112.14368),

work page arXiv
[41]

URLhttp://arxiv.org/abs/2112.14368

48550/arXiv.2112.14368. URLhttp://arxiv.org/abs/2112.14368. 12 G¨unter M. Ziegler.Lectures on Polytopes, volume 152 ofGraduate Texts in Mathematics. Springer, New York, NY ,

work page arXiv
[42]

, TITLE =

doi: 10.1007/978-1-4613-8431-1. URL http://link.springer. com/10.1007/978-1-4613-8431-1. Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. InAdvances in Neural Information Processing Systems, volume

work page doi:10.1007/978-1-4613-8431-1
[43]

Martin Zinkevich

URL https://papers.neurips.cc/paper_files/ paper/2013/hash/68053af2923e00204c3ca7c6a3150cf7-Abstract.html. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent

2013
[44]

Proposition 2(Predictive selection).Let mt be a prediction of ℓt and choose xt ∈ arg minu∈Ft−1 ⟨u, mt⟩

A PROOFS FROM THE MAIN TEXT A.1 WITHIN-FACE SELECTION When the previous optimal face Ft−1 is higher-dimensional, the learner chooses one occupancy inside it, and that choice is the source of the selection errorε sel t . Proposition 2(Predictive selection).Let mt be a prediction of ℓt and choose xt ∈ arg minu∈Ft−1 ⟨u, mt⟩. Ifδ t =∥m t −ℓ t∥∗, thenε sel t ≤...

1983
[45]

Table 1: Hyperparameters

The per-step loss unit in the priced quantities isγ. Table 1: Hyperparameters. Step sizes η and the method-specific parameters are swept over the ranges shown, and every benchmark uses32seeds. Benchmark Settings SimplexK∈ {16,32,64,128} actions, horizon T=5000 , γ=0.25, η∈ {0.04,0.08,0.15,0.3,0.6,1.0},24trials per regime, four regimes Layered treeH∈ {8,12...

2006

[1] [1]

Kakade, Jason D

Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. (arXiv:1908.00261),

work page arXiv 1908

[2] [2]

Kakade, Jason D

doi: 10.48550/arXiv.1908.00261. URLhttp://arxiv.org/abs/1908.00261. Carlo Alfano, Rui Yuan, and Patrick Rebeschini. A novel framework for policy mirror descent with general parameterization and linear convergence. (arXiv:2301.13139),

work page doi:10.48550/arxiv.1908.00261 1908

[3] [4]

doi: 10.1201/9781315140223

Routledge. doi: 10.1201/9781315140223. URL https://www.taylorfrancis.com/ books/9781315140223. Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175,

work page doi:10.1201/9781315140223

[4] [5]

URL http://arxiv.org/abs/1307

doi: 10.1287/opre.2015.1408. URL http://arxiv.org/abs/1307

work page doi:10.1287/opre.2015.1408 2015

[5] [6]

URL https: //epubs.siam.org/doi/10.1137/0331063

doi: 10.1137/0331063. URL https: //epubs.siam.org/doi/10.1137/0331063. Adrian Rivera Cardoso, He Wang, and Huan Xu. Large scale markov decision processes with changing rewards,

work page doi:10.1137/0331063

[6] [7]

Large Scale Markov Decision Processes with Changing Rewards

URLhttps://arxiv.org/abs/1905.10649v1. Nicol`o Cesa-Bianchi and G ´abor Lugosi.Prediction, Learning, and Games

work page internal anchor Pith review Pith/arXiv arXiv 1905

[7] [8]

org/abs/2006.14389v1

URL https://arxiv. org/abs/2006.14389v1. Robert Dadashi, Adrien Ali Ta¨ıga, Nicolas Le Roux, Dale Schuurmans, and Marc G. Bellemare. The value function polytope in reinforcement learning. (arXiv:1901.11524),

work page arXiv 2006

[8] [10]

Nima Eshraghi and Ben Liang

doi: 10.14288/1.0044649. Nima Eshraghi and Ben Liang. Dynamic regret of online mirror descent for relatively smooth convex cost functions. (arXiv:2202.12843),

work page doi:10.14288/1.0044649

[9] [11]

URL http://arxiv.org/abs/2202.12843

doi: 10.48550/arXiv.2202.12843. URL http://arxiv.org/abs/2202.12843. Eyal Even-Dar, Sham. M. Kakade, and Yishay Mansour. Online markov decision processes.Math- ematics of Operations Research, 34(3):726–736,

work page doi:10.48550/arxiv.2202.12843

[10] [12]

Yingjie Fei, Zhuoran Yang, Zhaoran Wang, and Qiaomin Xie

URL https://www.jstor.org/ stable/40538442. Yingjie Fei, Zhuoran Yang, Zhaoran Wang, and Qiaomin Xie. Dynamic regret of policy optimization in non-stationary environments. (arXiv:2007.00148),

work page arXiv 2007

[11] [13]

URLhttp://arxiv.org/abs/2007.00148

doi: 10.48550/arXiv.2007.00148. URLhttp://arxiv.org/abs/2007.00148. Eric C. Hall and Rebecca M. Willett. Dynamical models and tracking regret in online convex programming,

work page doi:10.48550/arxiv.2007.00148 2007

[12] [14]

Dynamical Models and Tracking Regret in Online Convex Programming

URLhttps://arxiv.org/abs/1301.1254v1. 10 Wasim Huleihel, Soumyabrata Pal, and Ofer Shayevitz. Learning user preferences in non-stationary environments. (arXiv:2101.12506),

work page internal anchor Pith review Pith/arXiv arXiv

[13] [15]

URL http:// arxiv.org/abs/2101.12506

doi: 10.48550/arXiv.2101.12506. URL http:// arxiv.org/abs/2101.12506. Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. (arXiv:1501.06225),

work page doi:10.48550/arxiv.2101.12506

[14] [16]

doi: 10.48550/arXiv.1501. 06225. URLhttp://arxiv.org/abs/1501.06225. Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial mdps with bandit feedback and unknown transition,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1501

[15] [17]

URL https://arxiv.org/abs/1912. 01192v5. S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning

1912

[16] [18]

URL https://proceedings.neurips.cc/paper_ files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract. html. Jan Felix Kleuker, Aske Plaat, and Thomas Moerland. On the effect of regularization in policy mirror descent. (arXiv:2507.08718),

work page arXiv 2001

[17] [19]

URL http: //arxiv.org/abs/2507.08718

doi: 10.48550/arXiv.2507.08718. URL http: //arxiv.org/abs/2507.08718. Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. (arXiv:2102.00135),

work page doi:10.48550/arxiv.2507.08718

[18] [20]

arXiv (2023) https://doi.org/10.48550/arXiv

doi: 10.48550/arXiv. 2102.00135. URLhttp://arxiv.org/abs/2102.00135. Long-Fei Li, Peng Zhao, and Zhi-Hua Zhou. Dynamic regret of adversarial linear mixture mdps. Advances in Neural Information Processing Systems 36, pp. 60685–60711,

work page internal anchor Pith review doi:10.48550/arxiv

[19] [21]

URL https://papers.nips.cc/paper_files/paper/2023/file/ becd02b89259774da2ede23116a80648-Paper-Conference.pdf

doi: 10.52202/ 075280-2650. URL https://papers.nips.cc/paper_files/paper/2023/file/ becd02b89259774da2ede23116a80648-Paper-Conference.pdf. Long-Fei Li, Peng Zhao, and Zhi-Hua Zhou. Near-optimal dynamic regret for adversarial linear mixture mdps, 2024a. URLhttps://arxiv.org/abs/2411.03107v1. Long-Fei Li, Peng Zhao, and Zhi-Hua Zhou. Improved algorithm for ...

work page arXiv 2023

[20] [22]

URL https://doi.org/10.1007/ s11228-008-0077-9

doi: 10.1007/s11228-008-0077-9. URL https://doi.org/10.1007/ s11228-008-0077-9. Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Ba s ¸ar. Model-free non-stationary rl: Near-optimal regret and applications in multi-agent rl and inventory control. (arXiv:2010.03161),

work page doi:10.1007/s11228-008-0077-9 2010

[21] [23]

URL http://arxiv.org/abs/ 2010.03161

doi: 10.48550/arXiv.2010.03161. URL http://arxiv.org/abs/ 2010.03161. Nikola Milosevic and Nico Scherf. The geometry of nonlinear reinforcement learning. (arXiv:2509.01432),

work page doi:10.48550/arxiv.2010.03161 2010

[22] [24]

URL http://arxiv.org/ abs/2509.01432

doi: 10.48550/arXiv.2509.01432. URL http://arxiv.org/ abs/2509.01432. Johannes M¨uller and Guido Mont´ufar. Geometry and convergence of natural policy gradient methods. Information Geometry, 7(S1):485–523,

work page doi:10.48550/arxiv.2509.01432

[23] [25]

URL http: //arxiv.org/abs/2211.02105

doi: 10.1007/s41884-023-00106-z. URL http: //arxiv.org/abs/2211.02105. Arkadi˘ı Semenovich Nemirovski˘ı and David Berkovich IUdin.Problem Complexity and Method Efficiency in Optimization. Wiley,

work page doi:10.1007/s41884-023-00106-z

[24] [26]

Andreas Paffenholz

URL https://papers.nips.cc/paper_files/paper/2010/ hash/7bb060764a818184ebb1cc0d43d382aa-Abstract.html. Andreas Paffenholz. Polyhedral geometry and linear optimization

2010

[25] [27]

Online Learning with Predictable Sequences

Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. (arXiv:1208.3728),

work page internal anchor Pith review Pith/arXiv arXiv

[26] [28]

Online Learning with Predictable Sequences

doi: 10.48550/arXiv.1208.3728. URL http://arxiv.org/abs/ 1208.3728. R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1208.3728

[27] [29]

URLhttps://arxiv.org/abs/1905.07773v1. J. Ben Schafer, Joseph Konstan, and John Riedl. Recommender systems in e-commerce. In Proceedings of the 1st ACM conference on Electronic commerce, EC ’99, pp. 158–166, New York, NY , USA,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[28] [30]

doi: 10.1145/336992.337035

Association for Computing Machinery. doi: 10.1145/336992.337035. URL https://dl.acm.org/doi/10.1145/336992.337035. Andreas Schlaginhaufen and Maryam Kamgarpour. Identifiability and generalizability in constrained inverse reinforcement learning. (arXiv:2306.00629),

work page doi:10.1145/336992.337035

[29] [31]

URL http://arxiv.org/abs/2306.00629

doi: 10.48550/arXiv.2306.00629. URL http://arxiv.org/abs/2306.00629. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. (arXiv:1502.05477),

work page doi:10.48550/arxiv.2306.00629

[30] [32]

Trust Region Policy Optimization

doi: 10.48550/arXiv.1502.05477. URL http: //arxiv.org/abs/1502.05477. Guy Shani, Ronen I. Brafman, and David Heckerman. An mdp-based recommender system. (arXiv:1301.0600),

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1502.05477

[31] [33]

An MDP-based Recommender System

doi: 10.48550/arXiv.1301.0600. URL http://arxiv.org/abs/ 1301.0600. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1301.0600

[32] [34]

Mirror descent policy optimization

Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. (arXiv:2005.09814),

work page arXiv 2005

[33] [35]

Mirror descent policy optimization

doi: 10.48550/arXiv.2005.09814. URL http:// arxiv.org/abs/2005.09814. Joel A. Tropp. Acm 204: Lectures on convex geometry

work page doi:10.48550/arxiv.2005.09814 2005

[34] [36]

URL https://resolver.caltech.edu/CaltechAUTHORS:20220412-220319430

doi: 10.7907/GEDA-H205. URL https://resolver.caltech.edu/CaltechAUTHORS:20220412-220319430. Chen-Yu Wei and Haipeng Luo. Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. (arXiv:2102.05406),

work page doi:10.7907/geda-h205

[35] [37]

URL http://arxiv.org/abs/2102.05406

doi: 10.48550/arXiv.2102.05406. URL http://arxiv.org/abs/2102.05406. Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. (arXiv:2105.11066),

work page doi:10.48550/arxiv.2102.05406

[36] [38]

URL http://arxiv.org/abs/ 2105.11066

doi: 10.48550/arXiv.2105.11066. URL http://arxiv.org/abs/ 2105.11066. Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. (arXiv:2007.03479),

work page doi:10.48550/arxiv.2105.11066 2007

[37] [39]

URL http://arxiv

doi: 10.48550/arXiv.2007.03479. URL http://arxiv. org/abs/2007.03479. Peng Zhao, Long-Fei Li, and Zhi-Hua Zhou. Dynamic regret of online markov decision processes,

work page doi:10.48550/arxiv.2007.03479 2007

[38] [40]

Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou

URLhttps://arxiv.org/abs/2208.12483v1. Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Adaptivity and non-stationarity: Problem- dependent dynamic regret for online convex optimization. (arXiv:2112.14368),

work page arXiv

[39] [41]

URLhttp://arxiv.org/abs/2112.14368

48550/arXiv.2112.14368. URLhttp://arxiv.org/abs/2112.14368. 12 G¨unter M. Ziegler.Lectures on Polytopes, volume 152 ofGraduate Texts in Mathematics. Springer, New York, NY ,

work page arXiv

[40] [42]

, TITLE =

doi: 10.1007/978-1-4613-8431-1. URL http://link.springer. com/10.1007/978-1-4613-8431-1. Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. InAdvances in Neural Information Processing Systems, volume

work page doi:10.1007/978-1-4613-8431-1

[41] [43]

Martin Zinkevich

URL https://papers.neurips.cc/paper_files/ paper/2013/hash/68053af2923e00204c3ca7c6a3150cf7-Abstract.html. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent

2013

[42] [44]

Proposition 2(Predictive selection).Let mt be a prediction of ℓt and choose xt ∈ arg minu∈Ft−1 ⟨u, mt⟩

A PROOFS FROM THE MAIN TEXT A.1 WITHIN-FACE SELECTION When the previous optimal face Ft−1 is higher-dimensional, the learner chooses one occupancy inside it, and that choice is the source of the selection errorε sel t . Proposition 2(Predictive selection).Let mt be a prediction of ℓt and choose xt ∈ arg minu∈Ft−1 ⟨u, mt⟩. Ifδ t =∥m t −ℓ t∥∗, thenε sel t ≤...

1983

[43] [45]

Table 1: Hyperparameters

The per-step loss unit in the priced quantities isγ. Table 1: Hyperparameters. Step sizes η and the method-specific parameters are swept over the ranges shown, and every benchmark uses32seeds. Benchmark Settings SimplexK∈ {16,32,64,128} actions, horizon T=5000 , γ=0.25, η∈ {0.04,0.08,0.15,0.3,0.6,1.0},24trials per regime, four regimes Layered treeH∈ {8,12...

2006