pith. sign in

arxiv: 2606.29092 · v1 · pith:OR44IFZ4new · submitted 2026-06-27 · 💻 cs.LG · cs.AI

Priced Motion Through Optimal Faces: A Normal-Fan Geometry for Non-Stationary Adversarial MDPs

Pith reviewed 2026-06-30 09:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords dynamic regretadversarial MDPsnormal fanoccupancy polytopeface-crossing pricenon-stationary environmentspolicy tracking
0
0 comments X

The pith

Dynamic regret in non-stationary adversarial MDPs decomposes exactly into priced face motion plus within-face selection error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard measures of non-stationarity such as total loss variation do not capture the true cost to a learner, because some large loss movements leave the optimal policy unchanged while small movements can force a complete policy switch. It models reward non-stationarity as motion through the normal fan of the occupancy-measure polytope, where each cone corresponds to a fixed optimal face and crossing a boundary changes the face. The central quantity is the face-crossing price, the minimum regret incurred by staying on the previous optimal face after the loss has moved. For any learner that tracks the prior face, dynamic regret factors exactly into the sum of these intrinsic prices along the path plus any remaining error from suboptimal choices inside the current face. This decomposition separates consequential non-stationarity from harmless variation, even when the raw loss path length is arbitrarily large.

Core claim

Occupancy measures form a polytope whose normal fan partitions loss space into cones, each exposing one optimal face. A path of losses traces a path through this fan; motion inside a cone leaves the optimal face fixed, while crossing a wall changes it. The face-crossing price is the minimum regret of remaining on the old face under the new loss. For any learner that tracks the previous face, dynamic regret equals the total priced motion through the fan plus the within-face selection error.

What carries the argument

The normal fan of the occupancy polytope together with the face-crossing price, which quantifies the regret cost of moving from one cone to another.

If this is right

  • Loss variation of arbitrary magnitude can produce zero priced motion and therefore zero added regret if the path stays inside one cone.
  • Identical variation in one coordinate can produce horizon-scale differences in regret depending on whether the change crosses a cone boundary.
  • Regret bounds can be expressed solely in terms of the priced length of the path through the fan rather than the total length of the loss path.
  • Learners achieve low dynamic regret by tracking only the sequence of optimal faces rather than reacting to all loss changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Algorithms could be designed to monitor estimated face crossings and switch only when the crossing price exceeds a threshold.
  • The same priced-motion decomposition may apply to other polyhedral decision problems where comparators lie on faces of a fixed feasible set.
  • In environments where face crossings are sparse, dynamic-regret bounds would scale with the number of crossings rather than total variation.

Load-bearing premise

Transitions stay fixed, so the set of occupancy measures remains the same polytope and every loss vector exposes a well-defined optimal face.

What would settle it

A concrete finite-horizon MDP, loss sequence, and face-tracking learner for which the observed dynamic regret differs from the sum of face-crossing prices plus within-face error by more than a vanishing additive term.

Figures

Figures reproduced from arXiv: 2606.29092 by Kai Hidajat.

Figure 1
Figure 1. Figure 1: Occupancy faces and normal cones. Left, the occupancy polytope, whose faces are the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Bellman price and its causal consequence. Left, under a fixed new loss the optimal [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left, the causal-anisotropy identity in the layered tree, where the measured [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The advantage threshold of Section 6. Actions whose estimated advantage exceeds their confidence band lie outside the optimal face and lose mass, while actions whose confidence band crosses zero lie near a wall of the fan and retain mass pending further evidence. Q d1 d2 Ft ¡ 1 \ Ft = ; ¡ cross t > 0 (a) Disjoint faces positive price nt ¡ 1 nt µt Ft ¡ 1 Ft Q d1 d2 Ft ¡ 1 \ Ft ; ¡ cross t = 0 (b) Intersecti… view at source ↗
Figure 5
Figure 5. Figure 5: Whether a crossing is priced depends on the relative position of the old and new optimal [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Why two MDPs are needed. Inside a single fixed MDP, a one-coordinate loss change that [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Simplex normal fan. Left, mean loss variation and intrinsic face price by regime, with [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Layered tree. Predictive power of four candidate path-motion summaries for dynamic [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Degenerate faces. Within-face selection error against the dimension of the free face, over [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Best cumulative dynamic regret of four updates on four non-stationary gridworld scenarios, [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Gridworld over training, the thresholded update (teal) against mirror descent (gray), with [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

In a changing decision problem, standard dynamic-regret analyses have often equated the cost of non-stationarity to how far loss moves. However, it is simultaneously possible for a loss sequence to travel far and retain the same optimal policy, or for a small movement in loss to force the optimal policy to change completely. Thus, the size of the movement through loss variation, transition variation, or comparator path length describe the adversary's motion, but not the cost of that motion to the control problem. For a more faithful analytic interpretation, this paper develops a normal-fan geometry for finite-horizon adversarial MDPs with fixed transitions. Occupancy measures form a polytope, and each loss vector exposes an optimal face of that polytope. Non-stationarity in rewards is therefore a path through the normal fan, where motion inside one cone leaves the optimal face unchanged, while crossing a wall may carry regret. We pose the notion of a face-crossing price, which is the minimum regret incurred by remaining on the previous optimal face under the new loss. For any learner that tracks the previous face, dynamic regret decomposes exactly into intrinsic priced face motion plus within-face selection error. The resulting theory separates consequential from harmless non-stationarity, where loss variation can be arbitrarily large at zero price, and identical one-coordinate variation can hide horizon-scale differences in regret.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper develops a normal-fan geometry for finite-horizon adversarial MDPs with fixed transitions, where occupancy measures form a polytope and loss vectors expose optimal faces. It defines face-crossing price as the minimum regret incurred by remaining on the previous optimal face under a new loss, and claims that for any learner tracking the previous face, dynamic regret decomposes exactly into intrinsic priced face motion plus within-face selection error. This separates consequential non-stationarity (face crossings) from harmless loss variation (motion inside a cone).

Significance. If the decomposition holds, the work supplies a geometrically precise accounting of non-stationarity cost that can be zero despite arbitrarily large loss changes when the optimal face is unchanged. The exact, definitionally grounded decomposition and the static-polytope assumption under fixed transitions are explicit strengths that enable falsifiable distinctions between motion types.

minor comments (2)
  1. [Abstract] Abstract, paragraph 3: the decomposition is stated without a forward reference to the theorem or proposition that establishes it; adding such a pointer would improve readability.
  2. The definition of face-crossing price (minimum regret of staying on the old face) is introduced without an accompanying small numerical example illustrating a zero-price versus positive-price crossing; a one-paragraph illustration in §2 or §3 would clarify the normal-fan geometry.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading, accurate summary of the normal-fan geometry and priced face-crossing decomposition, and for the positive significance assessment. The recommendation of minor revision is noted; no specific major comments were raised in the report.

Circularity Check

1 steps flagged

Decomposition reduces to definition of face-crossing price

specific steps
  1. self definitional [Abstract]
    "We pose the notion of a face-crossing price, which is the minimum regret incurred by remaining on the previous optimal face under the new loss. For any learner that tracks the previous face, dynamic regret decomposes exactly into intrinsic priced face motion plus within-face selection error."

    The priced motion term is defined precisely as the regret incurred by staying on the old face; the claimed exact decomposition for face-tracking learners is therefore an immediate restatement of that definition rather than a derived equality that could be independently verified or falsified.

full rationale

The paper introduces a face-crossing price as the minimum regret of remaining on the prior optimal face, then states that dynamic regret decomposes exactly into priced face motion plus within-face error for learners that track the prior face. This equality holds by construction once the price is defined that way; no independent derivation or external benchmark is required for the equality to hold. The geometry of the normal fan and fixed-transition polytope provide context but do not alter the definitional character of the split. No other load-bearing steps exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger extracted from abstract only; no further parameters or entities are visible.

axioms (1)
  • domain assumption Occupancy measures form a polytope for finite-horizon MDPs with fixed transitions
    Stated as the geometric foundation in abstract paragraph 3.
invented entities (1)
  • face-crossing price no independent evidence
    purpose: Minimum regret incurred by remaining on the previous optimal face under the new loss
    New scalar introduced to quantify the cost of crossing a normal-fan wall.

pith-pipeline@v0.9.1-grok · 5770 in / 1212 out tokens · 44737 ms · 2026-06-30T09:21:52.926664+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 38 canonical work pages · 9 internal anchors

  1. [1]

    Kakade, Jason D

    Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. (arXiv:1908.00261),

  2. [2]

    Kakade, Jason D

    doi: 10.48550/arXiv.1908.00261. URLhttp://arxiv.org/abs/1908.00261. Carlo Alfano, Rui Yuan, and Patrick Rebeschini. A novel framework for policy mirror descent with general parameterization and linear convergence. (arXiv:2301.13139),

  3. [4]

    doi: 10.1201/9781315140223

    Routledge. doi: 10.1201/9781315140223. URL https://www.taylorfrancis.com/ books/9781315140223. Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175,

  4. [5]

    URL http://arxiv.org/abs/1307

    doi: 10.1287/opre.2015.1408. URL http://arxiv.org/abs/1307

  5. [6]

    URL https: //epubs.siam.org/doi/10.1137/0331063

    doi: 10.1137/0331063. URL https: //epubs.siam.org/doi/10.1137/0331063. Adrian Rivera Cardoso, He Wang, and Huan Xu. Large scale markov decision processes with changing rewards,

  6. [7]

    Large Scale Markov Decision Processes with Changing Rewards

    URLhttps://arxiv.org/abs/1905.10649v1. Nicol`o Cesa-Bianchi and G ´abor Lugosi.Prediction, Learning, and Games

  7. [8]

    org/abs/2006.14389v1

    URL https://arxiv. org/abs/2006.14389v1. Robert Dadashi, Adrien Ali Ta¨ıga, Nicolas Le Roux, Dale Schuurmans, and Marc G. Bellemare. The value function polytope in reinforcement learning. (arXiv:1901.11524),

  8. [10]

    Nima Eshraghi and Ben Liang

    doi: 10.14288/1.0044649. Nima Eshraghi and Ben Liang. Dynamic regret of online mirror descent for relatively smooth convex cost functions. (arXiv:2202.12843),

  9. [11]

    URL http://arxiv.org/abs/2202.12843

    doi: 10.48550/arXiv.2202.12843. URL http://arxiv.org/abs/2202.12843. Eyal Even-Dar, Sham. M. Kakade, and Yishay Mansour. Online markov decision processes.Math- ematics of Operations Research, 34(3):726–736,

  10. [12]

    Yingjie Fei, Zhuoran Yang, Zhaoran Wang, and Qiaomin Xie

    URL https://www.jstor.org/ stable/40538442. Yingjie Fei, Zhuoran Yang, Zhaoran Wang, and Qiaomin Xie. Dynamic regret of policy optimization in non-stationary environments. (arXiv:2007.00148),

  11. [13]

    URLhttp://arxiv.org/abs/2007.00148

    doi: 10.48550/arXiv.2007.00148. URLhttp://arxiv.org/abs/2007.00148. Eric C. Hall and Rebecca M. Willett. Dynamical models and tracking regret in online convex programming,

  12. [14]

    Dynamical Models and Tracking Regret in Online Convex Programming

    URLhttps://arxiv.org/abs/1301.1254v1. 10 Wasim Huleihel, Soumyabrata Pal, and Ofer Shayevitz. Learning user preferences in non-stationary environments. (arXiv:2101.12506),

  13. [15]

    URL http:// arxiv.org/abs/2101.12506

    doi: 10.48550/arXiv.2101.12506. URL http:// arxiv.org/abs/2101.12506. Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. (arXiv:1501.06225),

  14. [16]

    doi: 10.48550/arXiv.1501. 06225. URLhttp://arxiv.org/abs/1501.06225. Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial mdps with bandit feedback and unknown transition,

  15. [17]

    URL https://arxiv.org/abs/1912. 01192v5. S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning

  16. [18]

    URL https://proceedings.neurips.cc/paper_ files/paper/2001/hash/4b86abe48d358ecf194c56c69108433e-Abstract. html. Jan Felix Kleuker, Aske Plaat, and Thomas Moerland. On the effect of regularization in policy mirror descent. (arXiv:2507.08718),

  17. [19]

    URL http: //arxiv.org/abs/2507.08718

    doi: 10.48550/arXiv.2507.08718. URL http: //arxiv.org/abs/2507.08718. Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. (arXiv:2102.00135),

  18. [20]

    arXiv (2023) https://doi.org/10.48550/arXiv

    doi: 10.48550/arXiv. 2102.00135. URLhttp://arxiv.org/abs/2102.00135. Long-Fei Li, Peng Zhao, and Zhi-Hua Zhou. Dynamic regret of adversarial linear mixture mdps. Advances in Neural Information Processing Systems 36, pp. 60685–60711,

  19. [21]

    URL https://papers.nips.cc/paper_files/paper/2023/file/ becd02b89259774da2ede23116a80648-Paper-Conference.pdf

    doi: 10.52202/ 075280-2650. URL https://papers.nips.cc/paper_files/paper/2023/file/ becd02b89259774da2ede23116a80648-Paper-Conference.pdf. Long-Fei Li, Peng Zhao, and Zhi-Hua Zhou. Near-optimal dynamic regret for adversarial linear mixture mdps, 2024a. URLhttps://arxiv.org/abs/2411.03107v1. Long-Fei Li, Peng Zhao, and Zhi-Hua Zhou. Improved algorithm for ...

  20. [22]

    URL https://doi.org/10.1007/ s11228-008-0077-9

    doi: 10.1007/s11228-008-0077-9. URL https://doi.org/10.1007/ s11228-008-0077-9. Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Ba s ¸ar. Model-free non-stationary rl: Near-optimal regret and applications in multi-agent rl and inventory control. (arXiv:2010.03161),

  21. [23]

    URL http://arxiv.org/abs/ 2010.03161

    doi: 10.48550/arXiv.2010.03161. URL http://arxiv.org/abs/ 2010.03161. Nikola Milosevic and Nico Scherf. The geometry of nonlinear reinforcement learning. (arXiv:2509.01432),

  22. [24]

    URL http://arxiv.org/ abs/2509.01432

    doi: 10.48550/arXiv.2509.01432. URL http://arxiv.org/ abs/2509.01432. Johannes M¨uller and Guido Mont´ufar. Geometry and convergence of natural policy gradient methods. Information Geometry, 7(S1):485–523,

  23. [25]

    URL http: //arxiv.org/abs/2211.02105

    doi: 10.1007/s41884-023-00106-z. URL http: //arxiv.org/abs/2211.02105. Arkadi˘ı Semenovich Nemirovski˘ı and David Berkovich IUdin.Problem Complexity and Method Efficiency in Optimization. Wiley,

  24. [26]

    Andreas Paffenholz

    URL https://papers.nips.cc/paper_files/paper/2010/ hash/7bb060764a818184ebb1cc0d43d382aa-Abstract.html. Andreas Paffenholz. Polyhedral geometry and linear optimization

  25. [27]

    Online Learning with Predictable Sequences

    Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. (arXiv:1208.3728),

  26. [28]

    Online Learning with Predictable Sequences

    doi: 10.48550/arXiv.1208.3728. URL http://arxiv.org/abs/ 1208.3728. R. Tyrrell Rockafellar.Convex Analysis. Princeton University Press,

  27. [29]

    URLhttps://arxiv.org/abs/1905.07773v1. J. Ben Schafer, Joseph Konstan, and John Riedl. Recommender systems in e-commerce. In Proceedings of the 1st ACM conference on Electronic commerce, EC ’99, pp. 158–166, New York, NY , USA,

  28. [30]

    doi: 10.1145/336992.337035

    Association for Computing Machinery. doi: 10.1145/336992.337035. URL https://dl.acm.org/doi/10.1145/336992.337035. Andreas Schlaginhaufen and Maryam Kamgarpour. Identifiability and generalizability in constrained inverse reinforcement learning. (arXiv:2306.00629),

  29. [31]

    URL http://arxiv.org/abs/2306.00629

    doi: 10.48550/arXiv.2306.00629. URL http://arxiv.org/abs/2306.00629. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. (arXiv:1502.05477),

  30. [32]

    Trust Region Policy Optimization

    doi: 10.48550/arXiv.1502.05477. URL http: //arxiv.org/abs/1502.05477. Guy Shani, Ronen I. Brafman, and David Heckerman. An mdp-based recommender system. (arXiv:1301.0600),

  31. [33]

    An MDP-based Recommender System

    doi: 10.48550/arXiv.1301.0600. URL http://arxiv.org/abs/ 1301.0600. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press,

  32. [34]

    Mirror descent policy optimization

    Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. (arXiv:2005.09814),

  33. [35]

    Mirror descent policy optimization

    doi: 10.48550/arXiv.2005.09814. URL http:// arxiv.org/abs/2005.09814. Joel A. Tropp. Acm 204: Lectures on convex geometry

  34. [36]

    URL https://resolver.caltech.edu/CaltechAUTHORS:20220412-220319430

    doi: 10.7907/GEDA-H205. URL https://resolver.caltech.edu/CaltechAUTHORS:20220412-220319430. Chen-Yu Wei and Haipeng Luo. Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. (arXiv:2102.05406),

  35. [37]

    URL http://arxiv.org/abs/2102.05406

    doi: 10.48550/arXiv.2102.05406. URL http://arxiv.org/abs/2102.05406. Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. (arXiv:2105.11066),

  36. [38]

    URL http://arxiv.org/abs/ 2105.11066

    doi: 10.48550/arXiv.2105.11066. URL http://arxiv.org/abs/ 2105.11066. Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. (arXiv:2007.03479),

  37. [39]

    URL http://arxiv

    doi: 10.48550/arXiv.2007.03479. URL http://arxiv. org/abs/2007.03479. Peng Zhao, Long-Fei Li, and Zhi-Hua Zhou. Dynamic regret of online markov decision processes,

  38. [40]

    Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou

    URLhttps://arxiv.org/abs/2208.12483v1. Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Adaptivity and non-stationarity: Problem- dependent dynamic regret for online convex optimization. (arXiv:2112.14368),

  39. [41]

    URLhttp://arxiv.org/abs/2112.14368

    48550/arXiv.2112.14368. URLhttp://arxiv.org/abs/2112.14368. 12 G¨unter M. Ziegler.Lectures on Polytopes, volume 152 ofGraduate Texts in Mathematics. Springer, New York, NY ,

  40. [42]

    , TITLE =

    doi: 10.1007/978-1-4613-8431-1. URL http://link.springer. com/10.1007/978-1-4613-8431-1. Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. InAdvances in Neural Information Processing Systems, volume

  41. [43]

    Martin Zinkevich

    URL https://papers.neurips.cc/paper_files/ paper/2013/hash/68053af2923e00204c3ca7c6a3150cf7-Abstract.html. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent

  42. [44]

    Proposition 2(Predictive selection).Let mt be a prediction of ℓt and choose xt ∈ arg minu∈Ft−1 ⟨u, mt⟩

    A PROOFS FROM THE MAIN TEXT A.1 WITHIN-FACE SELECTION When the previous optimal face Ft−1 is higher-dimensional, the learner chooses one occupancy inside it, and that choice is the source of the selection errorε sel t . Proposition 2(Predictive selection).Let mt be a prediction of ℓt and choose xt ∈ arg minu∈Ft−1 ⟨u, mt⟩. Ifδ t =∥m t −ℓ t∥∗, thenε sel t ≤...

  43. [45]

    Table 1: Hyperparameters

    The per-step loss unit in the priced quantities isγ. Table 1: Hyperparameters. Step sizes η and the method-specific parameters are swept over the ranges shown, and every benchmark uses32seeds. Benchmark Settings SimplexK∈ {16,32,64,128} actions, horizon T=5000 , γ=0.25, η∈ {0.04,0.08,0.15,0.3,0.6,1.0},24trials per regime, four regimes Layered treeH∈ {8,12...