arxiv: 2602.04737 · v2 · submitted 2026-02-04 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Rationality Measurement and Theory for Reinforcement Learning Agents

Kejiang Qian , Amos Storkey , Fengxiang He

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords rationality measuresreinforcement learningrational risk gapWasserstein distanceRademacher complexityenvironment shiftgeneralizationregularization

0 comments

The pith

The rational risk gap for RL agents is upper bounded by the 1-Wasserstein distance of environment shifts and the Rademacher complexity of the value function class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops measures of rationality for reinforcement learning agents by comparing their actions to those that would maximize a hidden true value function in the steepest direction. It defines expected rational risk as the accumulated value discrepancy along a deployment trajectory and introduces the rational risk gap as the difference between deployment and training versions of this quantity. The gap decomposes into an extrinsic component driven by changes in transition kernels and initial states between training and deployment, plus an intrinsic component reflecting the algorithm's generalisability. The extrinsic part is upper bounded by the 1-Wasserstein distance between the respective distributions, while the intrinsic part is bounded by the empirical Rademacher complexity of the value function class. These bounds explain why regularisers and domain randomisation preserve rationality and why environment shifts harm it, with experiments confirming the predictions.

Core claim

An action in deployment is perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the 1-Wasserstein distance between transition kernels

What carries the argument

The rational risk gap, decomposed into extrinsic (environment-shift) and intrinsic (generalisability) components and bounded using 1-Wasserstein distance and empirical Rademacher complexity of the value function class.

If this is right

Regularisers including layer normalisation, ℓ_{2} regularisation and weight normalisation reduce the intrinsic component of the rational risk gap.
Domain randomisation narrows the extrinsic component by bringing training and deployment distributions closer in Wasserstein distance.
Environment shifts between training and deployment increase the overall rational risk gap and produce less rational actions at test time.
The bounds supply a quantitative explanation for why the listed techniques improve or harm cross-environment rationality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could use measured Wasserstein distances as a pre-deployment proxy for expected rationality loss without needing the hidden value function.
The decomposition suggests new algorithm designs that explicitly penalise the intrinsic gap during training to improve deployment rationality.
Analogous rationality gaps and bounds could be derived for supervised learning under covariate shift using similar discrepancy measures.
Tighter analysis of the Rademacher term might yield practical regularisation schedules that directly target rationality preservation.

Load-bearing premise

A hidden true value function exists that defines perfectly rational actions, and the rational risk gap decomposes cleanly into extrinsic and intrinsic components with the stated upper bounds.

What would settle it

Experiments in which the rational risk gap repeatedly exceeds the sum of the 1-Wasserstein bound and the Rademacher complexity bound, even when a candidate true value function is supplied, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.04737 by Amos Storkey, Fengxiang He, Kejiang Qian.

**Figure 1.** Figure 1: Reward curves of DQN under different regularisation and domain randomisation techniques in Taxi-v3 and Cliff Walking environments. 5.1. Empirically testable hypotheses A good theory can explain and suggest empirically testable hypotheses (Lakatos, 1968; Popper, 2005). Our theory leads to the following hypotheses. H1: Benefits of regularisations Regularisers, such as layer normalisation (LN) (Ba et al., 201… view at source ↗

**Figure 2.** Figure 2: Rational risk gap of DQN under different regularisation and domain randomisation techniques in Taxi-v3 and Cliff Walking environments. Episode Rational risk gap Taxi Baseline 10% 30% 50% 70% Baseline 10% 30% 50% 70% Cliff Walking Episode [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Rational risk gap of DQN across different environment levels in Taxi-v3 and Cliff Walking environments. We evaluate DQN under increasing challenge levels of training environments (0%, 10%, 30%, 50%, 70%), presenting the probability of action randomisation during training [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Reward curves of DQN under different regularisation and domain randomisation techniques in Taxi-v3 and Cliff Walking environments with challenge level of 25%. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Special case of rational risk gap of DQN under different regularisation and domain randomisation techniques in Taxi-v3 and Cliff Walking environments with challenge level of 25%. At the beginning of training, the policy is still incapable and thus causes frequent cliff falls (meaning a -100 penalty), so the terminal condition is triggered very quickly. Consequently, episodes have short horizons, making bot… view at source ↗

**Figure 6.** Figure 6: Special case of rational risk gap of DQN across different environment levels in Taxi-v3 and Cliff Walking environments. We evaluate DQN under increasing challenge levels of training environments (0%, 10%, 30%, 50%, 70%), presenting the probability of action randomisation during training. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

read the original abstract

This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the $1$-Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, $\ell_2$ regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses. The code is available at https://github.com/EVIEHub/Rationality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines new rationality measures for RL agents and bounds the rational risk gap via Wasserstein distance and Rademacher complexity, with experiments backing regularization hypotheses.

read the letter

The core contribution here is a set of measures that quantify how much an RL policy's actions deviate from what would maximize a hidden true value function in deployment. They define expected rational risk as the cumulative value discrepancy along trajectories, then the rational risk gap as the difference between deployment and training versions. This gap splits into an extrinsic part from environment shifts and an intrinsic part from the agent's generalization limits, with upper bounds given by the 1-Wasserstein distance on transition kernels and initial states, plus the empirical Rademacher complexity of the value function class. They also derive hypotheses on why regularizers and domain randomization should reduce the gap, and the experiments line up with those directional predictions. The code release is a plus for checking the implementation details. What stands out is how the setup connects theoretical complexity tools directly to practical RL robustness questions without inventing entirely new machinery. The bounds rely on standard results rather than circular self-references, which keeps the argument grounded. The soft spots are in the decomposition itself. It assumes the value discrepancy separates additively into extrinsic and intrinsic terms without significant cross interactions, which would require the hidden value function to satisfy regularity conditions such as Lipschitz continuity with respect to the underlying metric. If those conditions do not hold or if approximation error interacts with the shift, the clean bounds may loosen. The modeling choice of a hidden true value function is also strong and may not translate to every real deployment setting. Overall this is aimed at RL theorists and practitioners focused on generalization and safe deployment. Readers working on distribution shift or regularization in RL would find the concepts and suggested experiments useful. It deserves peer review because the ideas are coherent, the math draws on established tools, and the experimental agreement provides initial support even if the proofs need close checking for the regularity assumptions.

Referee Report

3 major / 2 minor

Summary. The paper introduces rationality measures for RL agents, defining a perfectly rational action as one that maximizes a hidden true value function in the steepest direction. It defines expected rational risk (and its empirical training counterpart) as the value discrepancy between a policy's actions and their rational counterparts, then decomposes the rational risk gap between training and deployment into an extrinsic component (environment shift) and an intrinsic component (algorithm generalizability). These are upper-bounded by the 1-Wasserstein distance between transition kernels/initial distributions and the empirical Rademacher complexity of the value function class, respectively. The theory yields hypotheses on benefits of regularizers (layer norm, ℓ₂, weight norm) and domain randomization and harms of environment shifts; experiments are reported to confirm them, with code released.

Significance. If the decomposition and bounds are rigorously established without unaccounted interaction terms, the work supplies a concrete theoretical link between distribution shift, generalization, and rationality loss in RL, offering testable predictions for regularization and domain randomization that could inform safer deployment. The explicit code release and experimental agreement with the derived hypotheses are positive for reproducibility.

major comments (3)

[§3] §3 (decomposition of rational risk gap): the claim of a clean additive split into extrinsic (Wasserstein-bounded) and intrinsic (Rademacher-bounded) terms requires an explicit expansion showing that cross terms between environment shift and value-function approximation error are identically zero; the hidden true value function must also be shown to satisfy the Lipschitz condition needed for the Wasserstein bound to apply directly.
[Theorem 1] Theorem bounding the extrinsic component: the 1-Wasserstein distance is invoked on transition kernels and initial-state distributions, but the derivation must state the precise regularity (e.g., Lipschitz constant of the hidden value function w.r.t. the underlying metric) under which the bound holds; without this, the bound may be loose or inapplicable in general MDPs.
[§5] Experimental validation (§5): the reported agreement with hypotheses on regularizers is qualitative; quantitative reporting of measured rational risk gap versus the derived bounds (or at least confidence intervals on the gap) is needed to assess whether the theory is tight or merely directionally consistent.

minor comments (2)

[§2] Notation for the hidden value function and the steepest-direction maximizer should be introduced with a single consistent symbol set in §2 to avoid later ambiguity.
[Abstract] The abstract states that experiments are 'in full agreement' with the hypotheses; a brief sentence clarifying whether this includes statistical significance tests would improve precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below. We will revise the paper to incorporate the suggested clarifications and additional reporting, which we believe will strengthen the presentation of the theoretical results and experimental validation.

read point-by-point responses

Referee: [§3] §3 (decomposition of rational risk gap): the claim of a clean additive split into extrinsic (Wasserstein-bounded) and intrinsic (Rademacher-bounded) terms requires an explicit expansion showing that cross terms between environment shift and value-function approximation error are identically zero; the hidden true value function must also be shown to satisfy the Lipschitz condition needed for the Wasserstein bound to apply directly.

Authors: We thank the referee for this observation. The decomposition is additive by construction because the rational risk gap is the difference between the deployment expected rational risk (under the true environment) and the training empirical counterpart; the extrinsic term isolates the effect of the shift in transition kernels and initial distributions, while the intrinsic term isolates the generalization error of the value-function class. Any cross terms vanish due to linearity of expectation when the expectations are taken separately over the shifted distribution versus the empirical one. We will add an explicit algebraic expansion in the revised Section 3 to demonstrate this cancellation. We will also explicitly state the assumption that the hidden true value function is Lipschitz continuous with respect to the underlying metric on the state space (a standard regularity condition for Wasserstein bounds in MDPs) and indicate how the Lipschitz constant enters the bound. revision: yes
Referee: [Theorem 1] Theorem bounding the extrinsic component: the 1-Wasserstein distance is invoked on transition kernels and initial-state distributions, but the derivation must state the precise regularity (e.g., Lipschitz constant of the hidden value function w.r.t. the underlying metric) under which the bound holds; without this, the bound may be loose or inapplicable in general MDPs.

Authors: We agree that the regularity conditions should be stated explicitly. The bound in Theorem 1 holds when the hidden value function is L-Lipschitz continuous for a finite constant L with respect to the metric underlying the 1-Wasserstein distance. We will revise the statement of Theorem 1 and its proof to include this assumption clearly, together with the dependence of the bound on L, so that the conditions of applicability are unambiguous. revision: yes
Referee: [§5] Experimental validation (§5): the reported agreement with hypotheses on regularizers is qualitative; quantitative reporting of measured rational risk gap versus the derived bounds (or at least confidence intervals on the gap) is needed to assess whether the theory is tight or merely directionally consistent.

Authors: We acknowledge that the current experimental section reports only qualitative agreement with the derived hypotheses. In the revised manuscript we will augment Section 5 with quantitative comparisons: we will plot the measured rational risk gap against the corresponding theoretical upper bounds and report confidence intervals for the gaps computed over multiple independent runs. This will allow readers to evaluate the tightness of the bounds in practice. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds derived from external standard tools

full rationale

The paper defines a hidden true value function to identify perfectly rational actions, then defines expected rational risk as the value discrepancy and the rational risk gap as its training-deployment difference. This gap is decomposed into extrinsic (environment shift) and intrinsic (generalization) components, with explicit upper bounds given by the 1-Wasserstein distance on transition kernels/initial distributions and the empirical Rademacher complexity of the value-function class. Both bounding quantities are drawn from established external results in optimal transport and statistical learning theory; they are not obtained by fitting parameters inside the paper and renaming the fit as a prediction, nor by self-citation chains that presuppose the target claim. The derivation therefore remains self-contained against independent mathematical benchmarks and does not reduce any load-bearing step to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption of a hidden true value function and standard mathematical inequalities; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption There exists a hidden true value function that defines perfect rationality for actions.
This is invoked to define the rational counterpart actions and the value discrepancy.

pith-pipeline@v0.9.0 · 5519 in / 1270 out tokens · 58291 ms · 2026-05-16T07:16:56.266551+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The rational risk gap is decomposed into (1) an extrinsic component caused by environment shifts... upper bounded by... the 1-Wasserstein distance... and (2) an intrinsic one... empirical Rademacher complexity of the value function class.
Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 5 internal anchors

[1]

URL https://proceedings.mlr.press/v70/ azar17a.html. Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Oxford Univ

doi: 10.1093/acprof: oso/9780199535255.001.0001. Bukharin, A., Li, Y ., Yu, Y ., Zhang, Q., Chen, Z., Zuo, S., Zhang, C., Zhang, S., and Zhao, T. Robust multi-agent re- inforcement learning via adversarial regularization: The- oretical foundation and stable algorithms. InAdvances in Neural Information Processing Systems, volume 36, pp. 68121–68133,

work page doi:10.1093/acprof:
[3]

Lever- aging procedural generation to benchmark reinforcement learning

Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Lever- aging procedural generation to benchmark reinforcement learning. InProceedings of the 37th International Confer- ence on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 2048–2056,

work page 2048
[4]

A survey of sim-to-real methods in rl: Progress, prospects and challenges with foundation models.arXiv preprint arXiv:2502.13187,

Da, L., Turnau, J., Kutralingam, T. P., Velasquez, A., Shakar- ian, P., and Wei, H. A survey of sim-to-real methods in rl: Progress, prospects and challenges with foundation models.arXiv preprint arXiv:2502.13187,

work page arXiv
[5]

Dietterich, T

doi: https://doi.org/10.1111/j.1467-6486.1993.tb00317.x. Dietterich, T. G. Hierarchical reinforcement learning with the maxq value function decomposition.Journal of Arti- ficial Intelligence Research, 13(1):227–303,

work page doi:10.1111/j.1467-6486.1993.tb00317.x 1993
[6]

P., Ardon, L., and Ganesh, S

Evans, B. P., Ardon, L., and Ganesh, S. Modelling bounded rational decision-making through wasserstein constraints. arXiv preprint arXiv:2504.03743,

work page arXiv
[7]

Fishburn, P

doi: 10.1038/s41586-023-05732-2. Fishburn, P. C. Subjective expected utility: A review of normative theories.Theory and decision, 13(2):139–199,

work page doi:10.1038/s41586-023-05732-2
[8]

Gottesman, O., Asadi, K., Allen, C

doi: 10.1126/science.aac6076. Gottesman, O., Asadi, K., Allen, C. S., Lobel, S., Konidaris, G., and Littman, M. Coarse-grained smoothness for re- inforcement learning in metric spaces. InProceedings of The 26th International Conference on Artificial In- telligence and Statistics, volume 206 ofProceedings of Machine Learning Research, pp. 1390–1410,

work page doi:10.1126/science.aac6076
[9]

doi: https://doi.org/10.1006/game.2000.0838

ISSN 0899-8256. doi: https://doi.org/10.1006/game.2000.0838. URL https://www.sciencedirect.com/ science/article/pii/S0899825600908388. Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning.Journal of Machine Learning Research, 11:1563–1600,

work page doi:10.1006/game.2000.0838 2000
[10]

URL http: //www.jstor.org/stable/2778894

ISSN 00029602, 15375390. URL http: //www.jstor.org/stable/2778894. Kantorovich, L. V . Mathematical methods of organizing and planning production.Management Science, 6:366–422,

work page arXiv
[11]

Understanding deep neural function approximation in reinforcement learning via ϵ-greedy exploration.Advances in Neural Information Processing Systems, 35:5093–5108, 2022a

Liu, F., Viano, L., and Cevher, V . Understanding deep neural function approximation in reinforcement learning via ϵ-greedy exploration.Advances in Neural Information Processing Systems, 35:5093–5108, 2022a. Liu, X.-Y ., Xia, Z., Rui, J., Gao, J., Yang, H., Zhu, M., Wang, C., Wang, Z., and Guo, J. Finrl-meta: Market environments and benchmarks for data-dr...

work page arXiv
[12]

Playing Atari with Deep Reinforcement Learning

Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Salimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.arXiv preprint arXiv:1602.07868,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Proximal Policy Optimization Algorithms

URL https://arxiv.org/abs/ 1707.06347. Schulman, J., Chen, X., and Abbeel, P. Equivalence be- tween policy gradients and soft q-learning.arXiv preprint arXiv:1704.06440,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

doi: 10.1145/1143844.1143955

Association for Computing Machin- ery. doi: 10.1145/1143844.1143955. URL https: //doi.org/10.1145/1143844.1143955. Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based rl in contextual decision pro- cesses: Pac bounds and exponential improvements over model-free methods. InConference on Learning Theory (COLT),

work page doi:10.1145/1143844.1143955
[17]

Domain randomization for transferring deep neural networks from simulation to the real world

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30,

work page 2017
[18]

Valiant, L

1109/IROS.2017.8202133. Valiant, L. G. Rationality. InProceedings of the eighth annual conference on Computational learning theory, pp. 3–14,

work page arXiv 2017
[19]

Optimal T ransport: Old and New ; Grundlehren der Mathematischen Wissenschaften; Springer: Berlin/Heidelberg, Germany , 2009; V olume 338.https://doi.org/10.1007/978-3-540-71050-9

doi: 10.1007/978-3-540-71050-9. von Neumann, J. and Morgenstern, O.Theory of Games and Economic Behavior. Princeton University Press,

work page doi:10.1007/978-3-540-71050-9
[20]

Wang, H., Zheng, S., Xiong, C., and Socher, R

doi: 10.1515/9781400829460. Wang, H., Zheng, S., Xiong, C., and Socher, R. On the generalization gap in reparameterizable reinforcement learning. InProceedings of the 36th International Con- ference on Machine Learning, Proceedings of Machine Learning Research, pp. 6648–6658,

work page doi:10.1515/9781400829460
[21]

1 T TX t=1 f(˜st h) # , we calculate the expectation ofΦ E[Φ] =E s1 h,...,sT h

At time steph∈[H]overTepisodes, we have this policy drift bound, sup π∈Π Esh∼Dˆπ h Q∗ h(sh, aπ h)− 1 T TX t=1 Est h∼Dπt h Q∗ h(st h, aπ h) ≤L Π p 2 log|A|. Proof. This term, supπ∈Π Esh∼Dˆπ h Q∗ h(sh, aπ h)− 1 T PT t=1 Est h∼Dπt h Q∗ h(st h, aπ h) , measures the discrepancy between the state distribution Dˆπ h induced by the fixed policy ˆπand the state di...

work page 2003