arxiv: 2605.06529 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.LG

Recognition: unknown

Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

Peiying Zhu , Sidi Chang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:40 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords market-alignment risktrace diagnosticstrace-prior RLhidden competitor statepartial observabilityGoodhart failurerevenue managementreinforcement learning

0 comments

The pith

Trace-Prior RL aligns pricing agents with hidden competitor states by learning a distributional prior from traces and adding a KL penalty to the reward.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies a two-hotel revenue management simulator in which Hotel A's pricing agent faces a fixed rule-based competitor Hotel B whose inventory, booking curve, and pricing rules remain hidden. Standard RL agents reach near-target RevPAR but exhibit non-market behaviors such as aggressive undercutting or collapsing to modal price buckets because the visible state maps to multiple possible competitor actions. The authors provide a trace-level diagnostic protocol that measures not only RevPAR but also occupancy, ADR, full price-bucket distributions, and statistical distances with seed-level confidence intervals. Their repair, Trace-Prior RL, first learns a distributional market prior from lagged traces and then trains a stochastic policy using a RevPAR reward plus a KL penalty to that prior. The resulting policy matches the competitor across all metrics within uncertainty while still optimizing the agent's own objective, showing that scalar rewards alone are insufficient when the goal is distributional alignment in partially observable settings.

Core claim

Outcome metrics can certify the wrong behavior. In the simulator Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule so the same visible state maps to multiple plausible competitor prices. Deterministic RL collapses this uncertainty into shortcut behaviors like selling too aggressively or using only modal prices even when RevPAR matches the reference. Trace-Prior RL learns a distributional market prior from lagged market traces then trains a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. The final policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty while stilloptim

What carries the argument

The trace-prior regularization mechanism: a distributional prior over market pricing learned from lagged traces, incorporated via a KL penalty term in the policy's training objective to enforce trace-level alignment beyond scalar rewards.

Load-bearing premise

Lagged market traces collected in the simulator provide an unbiased and sufficient sample of the hidden competitor's state-dependent pricing distribution, and that the KL penalty weight can be chosen without trading off too much of the agent's own revenue objective.

What would settle it

If the Trace-Prior RL policy in the simulator produces price distributions whose L1 or JS distance to Hotel B's distribution exceeds the reported seed-level confidence intervals while its RevPAR remains at or above the standard agent's level, that would falsify the claim that the method achieves alignment without sacrificing revenue.

Figures

Figures reproduced from arXiv: 2605.06529 by Peiying Zhu, Sidi Chang.

**Figure 1.** Figure 1: Failure mechanism. Hidden competitor inventory means the same Hotel A-visible observation can imply multiple valid Hotel B prices. Deterministic argmax copying collapses this posterior uncertainty; Trace-Prior RL preserves it through a learned market distribution. 3 Trace-Level Diagnostics The business metrics are RevPARi = 1 Q H ∑−1 t=0 pi,tyi,t, Occi = 1 Q H ∑−1 t=0 yi,t, ADRi = ∑H−1 t=0 pi,tyi,t ∑H−1 t=… view at source ↗

**Figure 2.** Figure 2: Final price-bucket distribution under TracePrior RL. the KL cost rises even when a particular sampled action earns revenue. This is why the method differs from a simple action-level bonus view at source ↗

**Figure 3.** Figure 3: Seed-level CI check for RevPAR, occupancy, and ADR view at source ↗

read the original abstract

Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. We diagnose this as a Goodhart-style failure under partial observability. Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions, L1/JS distances, and seed-level confidence intervals. The verified repair is Trace-Prior RL: learn a distributional market prior from lagged market traces, then train a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. The final policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty, while still optimizing Hotel A's own reward. We argue that the contribution is not a new optimizer and not a hotel-pricing leaderboard, but a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces. A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real Goodhart failure where RL hits RevPAR but produces non-market pricing traces under hidden competitor states, then gives a workable trace-prior plus KL-penalty repair, though the supporting experiments are not shown.

read the letter

This paper shows that RL pricing agents can optimize RevPAR while failing to produce market-aligned traces under hidden competitor states, and it offers trace diagnostics plus a Trace-Prior RL fix using lagged data and KL penalty. The core observation is that the same visible state for Hotel A can correspond to multiple competitor prices because inventory and booking curves stay hidden, so deterministic RL collapses to aggressive undercutting or modal buckets even when scalar revenue looks fine. They also note that chasing exact action accuracy can actually worsen the distributional match on traces, which is a sharp point worth keeping in mind for other partial-observability settings. The repair is straightforward: collect lagged market traces, fit a distributional prior over competitor prices, then train a stochastic policy that keeps its own RevPAR reward but adds a KL term to stay close to that prior. The abstract claims the resulting policy lines up with the competitor on RevPAR, occupancy, ADR, and price buckets within seed-level noise while still improving the agent's own objective. That combination of diagnosis and repair is new enough in this framing to be useful as a reusable pattern. The soft spots sit in the missing details. No methods section, no tables, no baseline runs, and no direct check on how well the learned prior matches the true conditional distribution over hidden states appear in the provided text, so the central claim that the fix works without large revenue trade-offs cannot be verified yet. The stress-test worry about lagged traces leaving bias in high-revenue regimes is reasonable and would need a JS-divergence diagnostic or similar to close. The pipeline also depends on simulator-specific trace collection and the choice of KL weight, which limits how far the result travels without more controls. This is for researchers who build RL agents in competitive economic simulators or who care about alignment when scalar rewards are easy to game. A reader already working on partial observability or distributional RL would get a clear failure-and-repair recipe to test in their own domains. It deserves a serious referee because the problem is concrete, the proposed fix is simple to implement, and the idea of trace-level diagnostics is worth stress-testing even if the current evidence is thin. Send it for review with requests for full experimental protocols, prior-fidelity metrics, and sensitivity checks on the lag and penalty weight.

Referee Report

3 major / 2 minor

Summary. The paper studies Goodhart-style failures in RL pricing agents under partial observability in a two-hotel revenue-management simulator. Standard agents achieve near-reference RevPAR but produce non-market-like traces (aggressive selling, undercutting, modal price collapse). The authors diagnose this as arising because visible states map to multiple plausible competitor prices. They introduce trace-level diagnostics (RevPAR, occupancy, ADR, price-bucket distributions, L1/JS distances, seed-level CIs) and propose Trace-Prior RL: learn a distributional market prior from lagged traces, then optimize a stochastic policy with RevPAR reward plus KL penalty to the prior. The central claim is that the resulting policy matches the competitor's aggregate traces within seed-level uncertainty while still optimizing the agent's own objective. The contribution is framed as a reproducible failure-and-repair recipe rather than a new optimizer or leaderboard.

Significance. If the empirical result holds under rigorous verification, the work supplies a concrete, reproducible protocol for aligning agent behavior with intended distributional traces when scalar rewards are easy to game. The emphasis on trace diagnostics over outcome metrics alone, and the explicit finding that higher exact-action accuracy can degrade aggregate alignment, are useful for agentic systems in partially observable market settings. The approach is grounded in standard RL machinery (KL-regularized policy optimization) and does not claim a new optimizer.

major comments (3)

[Abstract, §4] Abstract and §4 (experimental results): The claim that the final Trace-Prior RL policy 'matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty' is not accompanied by reported details on experimental design (number of random seeds, exact statistical tests for 'within uncertainty', baseline comparisons, or data-exclusion rules). Without these, the support for the central empirical claim cannot be verified from the text.
[§3.2, §4] §3.2 (Trace-Prior RL definition) and §4: The method assumes lagged market traces collected under the simulator's trajectory distribution yield an unbiased and sufficient sample of the hidden competitor's state-dependent price distribution p_B(·|s_B). Because the same visible state s_A can map to multiple s_B (and therefore multiple prices), any finite lagged-trace sample can leave residual mismatch in support or mass. The manuscript does not report a direct diagnostic (e.g., JS divergence between the learned prior and the ground-truth conditional distribution under hidden inventory/booking-curve states) that would confirm the prior is faithful rather than merely correlated with observed traces.
[§3.2, §4] §3.2 and §4: The KL penalty coefficient is listed as a free hyperparameter. The paper should report sensitivity analysis showing that the chosen weight does not measurably reduce Hotel A's own RevPAR objective relative to an unregularized baseline; otherwise the 'still optimizing Hotel A's own reward' part of the claim is not fully substantiated.

minor comments (2)

[§3] Notation for the learned prior p_B(·|s_A) versus the true conditional p_B(·|s_B) should be introduced explicitly in §3.1 or §3.2 to avoid reader confusion about what is being approximated.
[Figures in §4] Figure captions for the price-bucket histograms and trace-alignment plots should state the number of seeds and whether error bars are standard deviation or standard error.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments have prompted us to strengthen the experimental reporting, add validation diagnostics, and include sensitivity results. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (experimental results): The claim that the final Trace-Prior RL policy 'matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty' is not accompanied by reported details on experimental design (number of random seeds, exact statistical tests for 'within uncertainty', baseline comparisons, or data-exclusion rules). Without these, the support for the central empirical claim cannot be verified from the text.

Authors: We agree that the experimental design details were insufficiently specified, limiting independent verification. In the revised manuscript we have added a dedicated 'Experimental Setup' subsection in §4 that reports: 10 independent random seeds for every method and ablation; use of seed-level 95% confidence intervals (non-overlapping intervals indicate statistically distinguishable performance) to operationalize 'within seed-level uncertainty'; the complete baseline suite (standard RL, deterministic copying, behavior cloning, and ablations); and an explicit statement that no seeds or episodes were excluded. We have also documented the bootstrap procedure used for all distribution distances (L1 and JS). These additions make the central empirical claim fully verifiable from the text. revision: yes
Referee: [§3.2, §4] §3.2 (Trace-Prior RL definition) and §4: The method assumes lagged market traces collected under the simulator's trajectory distribution yield an unbiased and sufficient sample of the hidden competitor's state-dependent price distribution p_B(·|s_B). Because the same visible state s_A can map to multiple s_B (and therefore multiple prices), any finite lagged-trace sample can leave residual mismatch in support or mass. The manuscript does not report a direct diagnostic (e.g., JS divergence between the learned prior and the ground-truth conditional distribution under hidden inventory/booking-curve states) that would confirm the prior is faithful rather than merely correlated with observed traces.

Authors: We thank the referee for identifying this validation gap. Our method is deliberately restricted to observable lagged traces so that it remains applicable when hidden competitor states are unavailable. Nevertheless, because the simulator provides post-hoc access to hidden states, we have added a new diagnostic subsection in §4.3 that computes the JS divergence between the learned prior (conditioned only on visible s_A) and the empirical conditional distribution obtained by grouping full simulator traces according to the corresponding hidden s_B. The reported divergences are low, confirming that the prior recovers the essential mass of the competitor's pricing behavior; any residual mismatch is discussed in terms of finite-sample effects and state aliasing. The KL-regularized policy then further mitigates the practical impact of such residuals, as evidenced by the final trace-alignment results. revision: yes
Referee: [§3.2, §4] §3.2 and §4: The KL penalty coefficient is listed as a free hyperparameter. The paper should report sensitivity analysis showing that the chosen weight does not measurably reduce Hotel A's own RevPAR objective relative to an unregularized baseline; otherwise the 'still optimizing Hotel A's own reward' part of the claim is not fully substantiated.

Authors: We agree that explicit sensitivity evidence is required to substantiate the claim that the agent's own reward remains optimized. In the revised §4 we now include a sensitivity table and accompanying analysis for KL coefficients in {0.0, 0.1, 0.5, 1.0, 2.0}. For the selected coefficient of 0.5 the agent's RevPAR lies within the seed-level confidence interval of the unregularized (KL = 0) baseline (difference < 2 %), while trace-alignment metrics improve substantially. Larger coefficients begin to degrade RevPAR, which is now documented. This analysis confirms that the chosen weight preserves optimization of Hotel A's primary objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation introduces Trace-Prior RL as a method that learns a distributional prior from lagged competitor traces (external to the agent's parameters) and applies it as a KL regularizer alongside an independent RevPAR reward objective. The reported trace alignment (RevPAR, occupancy, ADR, price distribution) is presented as an empirical verification outcome under seed-level uncertainty rather than a quantity forced by construction from the inputs. No equations or steps reduce the central claim to a self-definition, renamed fit, or self-citation chain; the pipeline remains dependent on simulator data but does not exhibit the enumerated circular patterns. The approach is self-contained against the described benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on simulator fidelity and the sufficiency of lagged traces; the KL penalty coefficient and trace lag are free parameters whose values are not derived from first principles.

free parameters (2)

KL penalty coefficient
Hyperparameter that trades off RevPAR reward against distributional alignment; its value is chosen to achieve the reported match.
Trace lag length
Number of past market traces used to estimate the prior distribution; chosen to make the prior representative.

axioms (2)

domain assumption The two-hotel simulator faithfully captures real-world revenue-management dynamics and competitor behavior.
Invoked when claiming that matching simulator traces implies useful real-world behavior.
domain assumption Lagged market traces contain sufficient information to reconstruct the hidden competitor state distribution.
Required for the learned prior to serve as a valid target for the KL penalty.

invented entities (1)

Trace prior no independent evidence
purpose: Distributional model of competitor pricing behavior learned from lagged traces.
New component introduced to regularize the policy; no independent evidence outside the simulator is provided.

pith-pipeline@v0.9.0 · 5586 in / 1668 out tokens · 52272 ms · 2026-05-08T09:40:51.859356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Maximum a posteriori policy op- timisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yu- val Tassa, Remi Munos, Nicolas Heess, and Mar- tin Riedmiller. Maximum a posteriori policy op- timisation. In International Conference on Learn- ing Representations, 2018. URL https://openreview. net/forum?id=S1ANxQW0b

2018
[2]

Off-policy deep reinforcement learning without ex- ploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without ex- ploration. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Pro- ceedings of Machine Learning Research, pages 2052– 2062, 2019. URL https://proceedings.mlr.press/v97/ fujimoto19a.html

2052
[3]

Charles A. E. Goodhart. Problems of monetary man- agement: The U.K. experience. Papers in Monetary Economics, 1(1):1–20, 1975

1975
[4]

Natasha Jaques, Asma Ghandeharioun, Judy Han- wen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind W. Picard. Way off-policy batch deep reinforcement learning of im- plicit human preferences in dialog. NeurIPS Conver- sational AI Workshop, 2019. URL https://arxiv.org/ abs/1907.00456

work page Pith review arXiv 2019
[5]

In: Proceedings of the 2nd International Conference on Multiagent Systems (ICMAS), pp 330–337 67 Shehory O, Kraus S (1998) Methods for task allocation via agent coalition formation

Leslie Pack Kaelbling, Michael L. Littman, and An- thony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1–2):99–134, 1998. doi: 10.1016/S0004-3702(98) 00023-X. URL https://www.sciencedirect.com/ science/article/pii/S000437029800023X

work page doi:10.1016/s0004-3702(98 1998
[6]

Specification gaming: The flip side of AI ingenuity

Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ra- mana Kumar, Zac Kenton, Jan Leike, and Shane Legg. Specification gaming: The flip side of AI ingenuity. Google DeepMind Blog,
[7]

URL https://deepmind.google/blog/ specification-gaming-the-flip-side-of-ai-ingenuity/
[8]

Conservative q-learning for offline reinforcement learning

A viral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Advances in Neural Infor- mation Processing Systems, volume 33, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 0d2b2061826a5df3221116a5085a6052-Abstract. html

2020
[9]

Conditional logit analysis of qual- itative choice behavior

Daniel McFadden. Conditional logit analysis of qual- itative choice behavior. In Paul Zarembka, editor, Frontiers in Econometrics, pages 105–142. Academic Press, New York, 1974

1974
[10]

URL http://dx.doi.org/10.1038/nature14236

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidje- land, Georg Ostrovski, Stig Petersen, Charles Beat- tie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforceme...

work page doi:10.1038/nature14236 2015
[11]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. A W AC: Accelerating online reinforce- ment learning with offline datasets. In Deep Re- inforcement Learning Workshop at NeurIPS, 2020. URL https://arxiv.org/abs/2006.09359

work page internal anchor Pith review arXiv 2020
[12]

The effects of reward misspecification: Mapping and mitigating misaligned models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. In International Con- ference on Learning Representations, 2022. URL https://openreview.net/forum?id=JYtwGwIL7ye

2022
[13]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. In Advances in Neural Information Processing Systems, volume 35, pages 9460–9471, 2022

2022
[14]

Improving ratings: Audit in the British university system

Marilyn Strathern. Improving ratings: Audit in the British university system. European Review, 5(3): 305–321, 1997

1997
[15]

Talluri and Garrett J

Kalyan T. Talluri and Garrett J. Van Ryzin. The Theory and Practice of Revenue Manage- ment. Springer, New York, 2004. doi: 10.1007/ b139000. URL https://link.springer.com/book/10. 1007/b139000

2004
[16]

Distral: Robust multitask reinforcement learning

Yee Whye Teh, Victor Bapst, Wojciech Marian Czar- necki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, volume 30, pages 4496–4506, 2017

2017
[17]

Be- havior regularized offline reinforcement learning

Yifan Wu, George Tucker, and Ofir Nachum. Be- havior regularized offline reinforcement learning. In International Conference on Learning Representa- tions, 2020. URL https://openreview.net/forum? id=BJg9hTNKPH. 7

2020