When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

Peiying Zhu; Sidi Chang

arxiv: 2605.18580 · v1 · pith:UEFLSXGTnew · submitted 2026-05-18 · 💻 cs.AI · cs.LG

When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

Peiying Zhu , Sidi Chang This is my paper

Pith reviewed 2026-05-20 10:46 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords discipline stabilitytrace-based evaluationhidden competitor statebehavioral disciplinerevenue managementreinforcement learningagent safetypricing policies

0 comments

The pith

Outcome metrics like revenue can approve AI pricing agents that violate the behavioral discipline of rule-based competitors under hidden state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that outcome-only evaluation can certify agents that meet business KPIs but violate behavioral discipline. In hotel pricing with hidden competitor state, learners achieve plausible revenue while failing rate discipline of rule-based systems. Discipline stability addresses this via trace-based checks that define benchmarks, restrict to deployment, induce diagnostics, and ablate mechanisms. Results show reward-only PPO misses alignments but trace-prior approaches preserve distributions.

Core claim

What carries the argument

Discipline stability, a trace-based evaluation paradigm that defines benchmark behavior from rule-based competitors, restricts observations to the deployment regime, induces trace diagnostics from failure, separates mechanisms with ablations, and tests transfer and deployment.

If this is right

Reward-only PPO variants miss trace alignment with benchmark behaviors under hidden competitor state.
Revealing hidden state reduces label uncertainty in behavioral compliance checks.
Trace-prior or corrected-history policies better preserve price and bid distributions.
Pure behavior cloning suffices for symmetric imitation while trace-prior RL enables bounded adaptation under capacity asymmetry.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Incorporating trace checks during training could prevent post-hoc discipline failures.
The method may apply to other competitive domains with partial observability like auctions or logistics.
Future benchmarks should prioritize deployment-regime simulation to surface hidden violations early.

Load-bearing premise

Defining benchmark behaviors from rule-based competitors and using trace diagnostics restricted to the deployment regime will reliably detect behavioral violations that outcome metrics overlook.

What would settle it

Demonstrating a policy that reaches revenue targets while fully replicating the rule-based competitor's price discipline and trace behavior under hidden states would falsify the need for separate trace evaluation.

Figures

Figures reproduced from arXiv: 2605.18580 by Peiying Zhu, Sidi Chang.

**Figure 1.** Figure 1: Hidden-state aliasing mechanism. The same visible Hotel A state can correspond to multiple hidden Hotel B inventories and therefore multiple valid market prices. action-label uncertainty. Revealing oracle 𝑞𝐵 sharply improves market-price prediction. 3. Trace learning is the repair signal. In the symmetric market, BC-only stochastic copy is nearly enough; in a capacity-asymmetric variant, a full-distributio… view at source ↗

read the original abstract

Outcome-only evaluation can certify economically unsafe agents: a policy can hit a business KPI while violating deployable behavioral discipline. In hotel pricing with hidden competitor state, a learner can achieve plausible revenue per available room while failing to preserve the rate discipline of a rule-based revenue-management competitor. We introduce discipline stability, a trace-based evaluation paradigm: define the benchmark behavior, restrict observations to the deployment regime, induce trace diagnostics from failure, separate mechanisms with ablations, and test transfer and deployment. Across a two-hotel benchmark and a compact hidden-budget bidding task, reward-only PPO variants miss trace alignment; revealing hidden state reduces label uncertainty; deterministic copy collapses uncertainty; and trace-prior or corrected history policies better preserve price or bid distributions. Pure behavior cloning is nearly enough for symmetric imitation, while Trace-Prior RL adds bounded adaptation under capacity asymmetry. The contribution is an evaluation and benchmark paradigm, not a new optimizer or a universal claim about MARL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Outcome metrics can accept RL policies that hit revenue targets but break rate discipline in hidden-competitor settings, and the paper supplies a concrete trace-based evaluation method to catch the mismatch.

read the letter

The main point is that a learned policy can post plausible RevPAR while deviating from the rate discipline a rule-based revenue manager would keep, especially when competitor state stays hidden. The paper shows this on hotel pricing and a compact bidding task and gives a way to detect it through traces rather than final scores alone. Discipline stability is the label they put on the approach: pick a benchmark behavior from the rule-based competitor, look only at the deployment regime, pull diagnostics from the places where the learned policy diverges, run ablations to separate causes, and check transfer. On the two-hotel benchmark, reward-only PPO misses the trace alignment; giving the agent the hidden state cuts label uncertainty; deterministic copying collapses it further; and trace-prior or corrected-history versions stay closer to the benchmark distributions. Pure behavior cloning is already strong when the two sides are symmetric, while trace-prior RL adds limited adaptation when capacity differs. They are explicit that this is an evaluation paradigm, not a new optimizer or a broad claim about MARL. That framing keeps the contribution focused and avoids overreach. The work is useful because it turns a practical deployment worry into a repeatable checklist with concrete tasks attached. The central observation holds: outcome numbers alone do not guarantee behavioral safety in these settings. The soft spot is that the reported results stay mostly qualitative. The abstract and summary describe the patterns but do not include the actual tables, effect sizes, or error bars that would let a reader judge how large or stable the gaps are. The assumption that the rule-based trace is the right reference point is reasonable for the paper’s purpose, yet it could shift if the benchmark itself changes. Readers who build or evaluate RL for pricing, bidding, or similar economic interactions with partial observability will find the benchmarks and the step-by-step method directly usable. It is worth sending to peer review because the problem is real, the method is straightforward to apply, and the initial evidence is consistent even if more quantitative detail would make the case stronger.

Referee Report

2 major / 2 minor

Summary. The paper claims that outcome-only evaluation in RL can certify unsafe agents in settings like hotel pricing with hidden competitor state, where a policy achieves good RevPAR but violates rate discipline of rule-based competitors. It introduces 'discipline stability' as a trace-based evaluation paradigm consisting of defining benchmark behavior, restricting to deployment regime, inducing trace diagnostics from failure, separating mechanisms with ablations, and testing transfer and deployment. On a two-hotel benchmark and bidding task, it finds that reward-only PPO misses alignment, hidden state revelation reduces uncertainty, and trace-prior or corrected history policies better preserve distributions. The contribution is positioned as an evaluation paradigm rather than a new optimizer.

Significance. If the findings are substantiated, this work has significance for developing safer RL agents in economic domains by highlighting the insufficiency of outcome metrics alone and offering a structured trace-based approach to detect behavioral violations. It gives credit to the use of ablations and transfer tests for mechanism separation, and the clear scoping as an evaluation framework. This could help in identifying when good outcomes mask discipline failures in competitive settings.

major comments (2)

[Experimental Results] The claims regarding the performance of different PPO variants and the benefits of revealing hidden state are presented qualitatively without quantitative data, tables, error bars, or specific ablation results. This undermines the ability to assess the magnitude of the reported effects and the reliability of the separation between mechanisms.
[Discipline Stability Paradigm] While the paradigm is outlined, there is no detailed formalization or pseudocode for how 'trace diagnostics from failure' are induced or how the restriction to the deployment regime is implemented, which is load-bearing for the central claim that this paradigm reliably identifies violations missed by outcome metrics.

minor comments (2)

[Abstract] The abstract is dense; breaking the description of the paradigm into bullet points or shorter sentences would improve readability.
[Related Work] Expand on connections to imitation learning and behavioral cloning to better position the trace-prior RL approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. The feedback identifies clear opportunities to strengthen the quantitative evidence and formal details supporting our evaluation paradigm. We address each major comment below and will revise the manuscript to incorporate the suggested improvements while preserving the paper's focus as an evaluation framework rather than a new optimizer.

read point-by-point responses

Referee: [Experimental Results] The claims regarding the performance of different PPO variants and the benefits of revealing hidden state are presented qualitatively without quantitative data, tables, error bars, or specific ablation results. This undermines the ability to assess the magnitude of the reported effects and the reliability of the separation between mechanisms.

Authors: We agree that the current manuscript presents the differences in trace alignment, uncertainty reduction, and distribution preservation primarily through qualitative descriptions. In the revised version we will add quantitative tables reporting mean KL-divergence between learned and benchmark price/bid distributions, RevPAR values, and discipline-violation counts, each with standard errors computed over multiple random seeds. Figures will include error bars, and a dedicated ablation table will quantify effect sizes for hidden-state revelation versus trace-prior or corrected-history policies, enabling readers to evaluate both magnitude and reliability of the reported separations. revision: yes
Referee: [Discipline Stability Paradigm] While the paradigm is outlined, there is no detailed formalization or pseudocode for how 'trace diagnostics from failure' are induced or how the restriction to the deployment regime is implemented, which is load-bearing for the central claim that this paradigm reliably identifies violations missed by outcome metrics.

Authors: We accept that greater formal precision is required. The revision will include a new subsection with mathematical definitions: the deployment regime is formalized as an observation mask projecting the full state onto the agent's observable variables; trace diagnostics are defined via a distance metric (e.g., total variation or KL) between the agent's action trace and the benchmark rule-based trace, with failure declared when the distance exceeds a pre-specified discipline threshold. We will also supply pseudocode for the complete pipeline—benchmark definition, regime restriction, diagnostic induction, ablation separation, and transfer testing—to make the procedure reproducible and to reinforce the claim that outcome metrics alone can miss behavioral violations. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a trace-based evaluation paradigm for assessing agent discipline under hidden competitor states in hotel pricing and bidding tasks. It defines benchmarks from rule-based competitors, restricts to deployment regimes, and reports experimental outcomes from ablations, transfer tests, and variants like reward-only PPO versus trace-prior RL. No load-bearing step reduces by construction to fitted parameters, self-definitions, or self-citation chains; the contribution is explicitly framed as an evaluation framework rather than a derived optimizer or universal claim, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on domain assumptions about benchmark behaviors and the utility of trace diagnostics rather than free parameters or new physical entities. No explicit fitted values or invented particles are described.

axioms (1)

domain assumption Benchmark behavior from rule-based revenue management systems provides a reliable reference for deployable discipline.
The paradigm begins by defining the benchmark behavior from existing rule-based competitors.

invented entities (1)

discipline stability no independent evidence
purpose: Trace-based evaluation paradigm to detect behavioral violations missed by outcome metrics.
New evaluation concept introduced to organize trace diagnostics and ablations.

pith-pipeline@v0.9.0 · 5694 in / 1309 out tokens · 43200 ms · 2026-05-20T10:46:12.100626+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce discipline stability, a trace-based evaluation paradigm: define the benchmark behavior, restrict observations to the deployment regime, induce trace diagnostics from failure, separate mechanisms with ablations, and test transfer and deployment.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Maximum a Posteriori Policy Optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a Posteriori Policy Optimisation. In International Conference on Learning Representations,

work page
[2]

URL https://openreview.net/forum?id=S1AN xQW0b

work page
[3]

Optimal Dy- namic Pricing of Inventories with Stochastic Demand over Finite Horizons

Guillermo Gallego and Garrett van Ryzin. Optimal Dy- namic Pricing of Inventories with Stochastic Demand over Finite Horizons. Management Science, 40(8):999– 1020, 1994. doi: 10.1287/mnsc.40.8.999

work page doi:10.1287/mnsc.40.8.999 1994
[4]

Charles A. E. Goodhart. Problems of Monetary Man- agement: The U.K. Experience. In Papers in Mone- tary Economics. Reserve Bank of Australia, 1975. URL https://www.econbiz.de/Record/problems-of-monetary- management-the-u-k-experience-goodhart-charles/1000 2525062

work page 1975
[5]

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Prefer- ences in Dialog, 2019

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Prefer- ences in Dialog, 2019. URL https://arxiv.org/abs/1907 .00456

work page 2019
[6]

Littman, and An- thony R

Leslie Pack Kaelbling, Michael L. Littman, and An- thony R. Cassandra. Planning and Acting in Partially Ob- servable Stochastic Domains. Artificial Intelligence, 101 (1–2):99–134, 1998. doi: 10.1016/S0004-3702(98)00023- X

work page doi:10.1016/s0004-3702(98)00023- 1998
[7]

Solomon Kullback and Richard A. Leibler. On Informa- tion and Suﬀiciency. The Annals of Mathematical Statis- tics, 22(1):79–86, 1951. doi: 10.1214/aoms/1177729694

work page doi:10.1214/aoms/1177729694 1951
[8]

1991 , publisher =

Jianhua Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37 (1):145–151, 1991. doi: 10.1109/18.61115

work page doi:10.1109/18.61115 1991
[9]

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Ryan Lowe, Yi Wu, A viv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems, vol- ume 30, 2017. URL https://papers.neurips.cc/paper_fi les/paper/2017/hash/68a9750337a418a86fe06c1991a1d6 4c-Abstract.html

work page 2017
[10]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. A W AC: Accelerating Online Reinforce- ment Learning with Offline Datasets, 2020. URL https: //arxiv.org/abs/2006.09359

work page internal anchor Pith review Pith/arXiv arXiv 2020
[11]

The Effects of Reward Misspecification: Mapping and Miti- gating Misaligned Models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The Effects of Reward Misspecification: Mapping and Miti- gating Misaligned Models. In International Conference on Learning Representations, 2022. URL https://openre view.net/forum?id=JYtwGwIL7ye

work page 2022
[12]

A Reduction of Imitation Learning and Structured Predic- tion to No-Regret Online Learning

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A Reduction of Imitation Learning and Structured Predic- tion to No-Regret Online Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelli- gence and Statistics, volume 15 of Proceedings of Ma- chine Learning Research, pages 627–635. PMLR, 2011. URL https://proceedings.mlr....

work page 2011
[13]

Proximal Policy Optimiza- tion Algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimiza- tion Algorithms, 2017. URL https://arxiv.org/abs/1707 .06347

work page 2017
[14]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krashenin- nikov, and David Krueger. Defining and Characterizing Reward Gaming. In Advances in Neural Information Pro- 10 cessing Systems, volume 35, 2022. URL https://papers .neurips.cc/paper_files/paper/2022/hash/3d719fee332c aa23d5038b8a90e81796-Abstract-Conference.html

work page 2022
[15]

Talluri, G

Kalyan T. Talluri and Garrett J. van Ryzin. The Theory and Practice of Revenue Management. Springer, 2004. URL https://link.springer.com/book/10.1007/b139000

work page doi:10.1007/b139000 2004
[16]

Distral: Robust Mul- titask Reinforcement Learning

Yee Whye Teh, Victor Bapst, Wojciech Marian Czar- necki, John Quan, James Kirkpatrick, Raia Hadsell, Nico- las Heess, and Razvan Pascanu. Distral: Robust Mul- titask Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://papers.nips.cc/paper_files/paper/2017/hash/0 abdc563a06105aee3c6136871c9f4d1-Abstract.html

work page 2017
[17]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Ofir Nachum. Behavior Regularized Offline Reinforcement Learning. In Inter- national Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJg9hTNKPH

work page 2020
[18]

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. In Advances in Neural Information Processing Systems, volume 35, 2022. URL https://papers.neurips.cc/pape r_files/paper/2022/hash/9c1535a02f0ce079433344e1 4d910597-Abstract-Datasets_and_Benchmarks.ht...

work page 2022
[19]

Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

Peiying Zhu and Sidi Chang. Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State, 2026. URL https://ar xiv.org/abs/2605.06529v1. Version v1. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Maximum a Posteriori Policy Optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a Posteriori Policy Optimisation. In International Conference on Learning Representations,

work page

[2] [2]

URL https://openreview.net/forum?id=S1AN xQW0b

work page

[3] [3]

Optimal Dy- namic Pricing of Inventories with Stochastic Demand over Finite Horizons

Guillermo Gallego and Garrett van Ryzin. Optimal Dy- namic Pricing of Inventories with Stochastic Demand over Finite Horizons. Management Science, 40(8):999– 1020, 1994. doi: 10.1287/mnsc.40.8.999

work page doi:10.1287/mnsc.40.8.999 1994

[4] [4]

Charles A. E. Goodhart. Problems of Monetary Man- agement: The U.K. Experience. In Papers in Mone- tary Economics. Reserve Bank of Australia, 1975. URL https://www.econbiz.de/Record/problems-of-monetary- management-the-u-k-experience-goodhart-charles/1000 2525062

work page 1975

[5] [5]

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Prefer- ences in Dialog, 2019

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Prefer- ences in Dialog, 2019. URL https://arxiv.org/abs/1907 .00456

work page 2019

[6] [6]

Littman, and An- thony R

Leslie Pack Kaelbling, Michael L. Littman, and An- thony R. Cassandra. Planning and Acting in Partially Ob- servable Stochastic Domains. Artificial Intelligence, 101 (1–2):99–134, 1998. doi: 10.1016/S0004-3702(98)00023- X

work page doi:10.1016/s0004-3702(98)00023- 1998

[7] [7]

Solomon Kullback and Richard A. Leibler. On Informa- tion and Suﬀiciency. The Annals of Mathematical Statis- tics, 22(1):79–86, 1951. doi: 10.1214/aoms/1177729694

work page doi:10.1214/aoms/1177729694 1951

[8] [8]

1991 , publisher =

Jianhua Lin. Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37 (1):145–151, 1991. doi: 10.1109/18.61115

work page doi:10.1109/18.61115 1991

[9] [9]

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Ryan Lowe, Yi Wu, A viv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems, vol- ume 30, 2017. URL https://papers.neurips.cc/paper_fi les/paper/2017/hash/68a9750337a418a86fe06c1991a1d6 4c-Abstract.html

work page 2017

[10] [10]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. A W AC: Accelerating Online Reinforce- ment Learning with Offline Datasets, 2020. URL https: //arxiv.org/abs/2006.09359

work page internal anchor Pith review Pith/arXiv arXiv 2020

[11] [11]

The Effects of Reward Misspecification: Mapping and Miti- gating Misaligned Models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The Effects of Reward Misspecification: Mapping and Miti- gating Misaligned Models. In International Conference on Learning Representations, 2022. URL https://openre view.net/forum?id=JYtwGwIL7ye

work page 2022

[12] [12]

A Reduction of Imitation Learning and Structured Predic- tion to No-Regret Online Learning

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A Reduction of Imitation Learning and Structured Predic- tion to No-Regret Online Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelli- gence and Statistics, volume 15 of Proceedings of Ma- chine Learning Research, pages 627–635. PMLR, 2011. URL https://proceedings.mlr....

work page 2011

[13] [13]

Proximal Policy Optimiza- tion Algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimiza- tion Algorithms, 2017. URL https://arxiv.org/abs/1707 .06347

work page 2017

[14] [14]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krashenin- nikov, and David Krueger. Defining and Characterizing Reward Gaming. In Advances in Neural Information Pro- 10 cessing Systems, volume 35, 2022. URL https://papers .neurips.cc/paper_files/paper/2022/hash/3d719fee332c aa23d5038b8a90e81796-Abstract-Conference.html

work page 2022

[15] [15]

Talluri, G

Kalyan T. Talluri and Garrett J. van Ryzin. The Theory and Practice of Revenue Management. Springer, 2004. URL https://link.springer.com/book/10.1007/b139000

work page doi:10.1007/b139000 2004

[16] [16]

Distral: Robust Mul- titask Reinforcement Learning

Yee Whye Teh, Victor Bapst, Wojciech Marian Czar- necki, John Quan, James Kirkpatrick, Raia Hadsell, Nico- las Heess, and Razvan Pascanu. Distral: Robust Mul- titask Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 30, 2017. URL https://papers.nips.cc/paper_files/paper/2017/hash/0 abdc563a06105aee3c6136871c9f4d1-Abstract.html

work page 2017

[17] [17]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Ofir Nachum. Behavior Regularized Offline Reinforcement Learning. In Inter- national Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJg9hTNKPH

work page 2020

[18] [18]

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. In Advances in Neural Information Processing Systems, volume 35, 2022. URL https://papers.neurips.cc/pape r_files/paper/2022/hash/9c1535a02f0ce079433344e1 4d910597-Abstract-Datasets_and_Benchmarks.ht...

work page 2022

[19] [19]

Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

Peiying Zhu and Sidi Chang. Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State, 2026. URL https://ar xiv.org/abs/2605.06529v1. Version v1. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026