pith. sign in

arxiv: 2605.24878 · v1 · pith:QH7ZQTP3new · submitted 2026-05-24 · 💱 q-fin.TR · q-fin.MF

Entropy-Regularized Certainty-Equivalent Bellman Policies for Risk-Sensitive Market Making

Pith reviewed 2026-06-29 23:17 UTC · model grok-4.3

classification 💱 q-fin.TR q-fin.MF
keywords risk-sensitive market makingentropy regularizationBellman operatorcertainty equivalentexponential utilityGibbs policiesHamiltoniandiscretization error
0
0 comments X

The pith

An entropy-regularized Bellman operator applies log-sum-exp directly to certainty-equivalent scores for risk-sensitive market making and converges to the continuous-time value at rate O(h + λ(1 + |log λ|)).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a discrete approximation scheme for a dealer who sets bid and ask quotes to maximize an exponential-utility certainty equivalent while facing Brownian midprice risk and Poisson order arrivals whose intensities depend on the posted quotes. It constructs an exact discrete entropy-regularized Bellman operator that regularizes the certainty-equivalent values of deterministic quote choices rather than a risk-neutral reward, because the exponential certainty equivalent does not commute with randomization over quotes. The work proves that the discrete values converge uniformly to the unregularized continuous-time risk-sensitive value as the time step h and the entropy parameter λ both tend to zero. It further derives performance bounds for the induced Gibbs policies under a fresh-sampling implementation and shows that these policies concentrate near the optimal quote set when the Hamiltonian satisfies a quadratic growth condition. Numerical tests in an Avellaneda-Stoikov model confirm the predicted scaling of discretization error, entropy bias, and policy concentration.

Core claim

We introduce an exact discrete entropy-regularized Bellman operator that applies log-sum-exp regularization to deterministic-action certainty-equivalent scores, rather than to a risk-neutral one-step reward. This distinction is essential because the exponential certainty equivalent does not commute with quote randomization. For time step h and entropy parameter λ, we prove uniform convergence to the unregularized continuous-time risk-sensitive value at rate O(h + λ(1 + |log λ|)). We also prove certainty-equivalent performance bounds for the induced Gibbs policies under a fresh-sampling relaxed implementation, in which quote marks are sampled at potential fill events rather than frozen over a

What carries the argument

The exact discrete entropy-regularized Bellman operator that applies log-sum-exp regularization to the certainty-equivalent scores of deterministic quote actions.

If this is right

  • The discrete values converge uniformly to the continuous-time risk-sensitive value at the stated rate.
  • Gibbs policies obtained from the operator satisfy explicit certainty-equivalent performance bounds under fresh sampling.
  • Under the quadratic growth condition the policies concentrate around the unregularized optimal quote set.
  • A cheaper Hamiltonian-Gibbs proxy achieves the same order of performance bound as the exact operator.
  • Numerical experiments in the Avellaneda-Stoikov model confirm the predicted scalings for error, bias, gap, concentration, and proxy consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same non-commutativity argument suggests that entropy regularization should be applied to certainty equivalents in other exponential-utility control problems beyond market making.
  • Fresh sampling of quotes at fill times may be worth testing in live high-frequency systems where freezing quotes for fixed intervals is costly.
  • The quadratic growth condition on the Hamiltonian points to a robustness property that could be checked in models with different order-arrival specifications.
  • The lower-cost proxy could serve as a practical starting point for online policy improvement when exact Bellman updates are too expensive.

Load-bearing premise

A quadratic growth condition on the Hamiltonian in the quote coordinates is required for the regularized policies to concentrate around the optimal quote set.

What would settle it

A controlled simulation in which the value gap between the discrete regularized operator and the continuous-time limit fails to shrink proportionally to h when λ is held fixed and small would falsify the claimed convergence rate.

Figures

Figures reproduced from arXiv: 2605.24878 by Tenghan Zhong.

Figure 1
Figure 1. Figure 1: Value convergence. Left: first-order discretization error [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fresh-sampling policy performance gap. The certainty-equivalent gap [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Quote convergence. Left: squared active-coordinate quote error. Right: integrated [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Exact Bellman validation. Left: one-step consistency [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

We study a finite-inventory risk-sensitive market making problem in which a dealer controls bid and ask quotes, faces Brownian midprice risk, and receives liquidity-taking orders through point processes with quote-dependent intensities. The objective is the certainty equivalent induced by exponential utility with terminal and running inventory penalties. We introduce an exact discrete entropy-regularized Bellman operator that applies log-sum-exp regularization to deterministic-action certainty-equivalent scores, rather than to a risk-neutral one-step reward. This distinction is essential because the exponential certainty equivalent does not commute with quote randomization. For time step \(h\) and entropy parameter \(\lambda\), we prove uniform convergence to the unregularized continuous-time risk-sensitive value at rate \[ O\bigl(h+\lambda(1+|\log\lambda|)\bigr). \] We also prove certainty-equivalent performance bounds for the induced Gibbs policies under a fresh-sampling relaxed implementation, in which quote marks are sampled at potential fill events rather than frozen over a time step. Under a quadratic growth condition on the Hamiltonian in the relevant quote coordinates, these policies concentrate around the unregularized optimal quote set. Finally, we show that a lower-cost Hamiltonian-Gibbs proxy satisfies a certainty-equivalent performance bound of the same order as the exact Bellman Gibbs policy. Numerical experiments in an Avellaneda--Stoikov specification support the predicted scaling for discretization error, entropy bias, policy gap, quote concentration, and exact-versus-proxy consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies finite-inventory risk-sensitive market making under exponential utility with Brownian midprice risk and quote-dependent point-process intensities. It introduces an exact discrete entropy-regularized Bellman operator that regularizes deterministic-action certainty-equivalent scores (rather than risk-neutral rewards), proves uniform convergence of the discrete value to the continuous-time limit at rate O(h + λ(1 + |log λ|)), derives certainty-equivalent performance bounds for the induced Gibbs policies under a fresh-sampling implementation, and shows that these policies concentrate around the optimal quote set when the Hamiltonian satisfies a quadratic growth condition in the quote coordinates. A lower-cost Hamiltonian-Gibbs proxy is also analyzed, and numerical experiments in an Avellaneda-Stoikov specification are used to illustrate the predicted scalings for discretization error, entropy bias, policy gap, and quote concentration.

Significance. If the quadratic-growth assumption holds in the model, the work supplies a rigorous discrete-time approximation framework for risk-sensitive market making with explicit non-asymptotic rates and a clear separation between the entropy-regularized operator and the non-commuting certainty equivalent. The exact operator construction and the numerical confirmation of the scaling laws are concrete strengths. The concentration and performance-bound claims, however, rest on an external modeling assumption whose validity for the exponential-utility objective and intensity functions is not established in the manuscript.

major comments (2)
  1. [Abstract / concentration theorem] Abstract and the statement of the concentration theorem: the claim that the Gibbs policies concentrate around the unregularized optimal quote set is conditioned on a quadratic growth condition on the Hamiltonian in the relevant quote coordinates. No verification or proof is supplied that this condition is satisfied by the point-process intensities or the exponential-utility Hamiltonian in the Avellaneda-Stoikov specification (or in general), rendering the concentration result conditional on an unexamined modeling assumption.
  2. [Performance bounds / fresh-sampling implementation] Performance-bound section (fresh-sampling relaxed implementation): the certainty-equivalent bounds are derived under the fresh-sampling scheme in which quotes are re-sampled at potential fill events. It is not shown that the same bounds continue to hold, or degrade gracefully, when the more common fixed-quote implementation over each time step h is used instead; this implementation detail is load-bearing for the practical relevance of the bounds.
minor comments (2)
  1. [Convergence theorem] Notation for the entropy parameter λ and time step h is introduced in the abstract but the precise functional dependence of the O(·) constants on model primitives (intensity bounds, volatility, inventory penalties) is not stated explicitly in the convergence theorem.
  2. [Numerical experiments] The numerical experiments section would benefit from an explicit table or plot showing the measured exponent of the discretization error versus h and the entropy bias versus λ to allow direct comparison with the predicted O(h + λ(1+|log λ|)) rate.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thorough review and the positive comments on the exact discrete entropy-regularized Bellman operator, the convergence rates, and the numerical experiments. We address the two major comments below.

read point-by-point responses
  1. Referee: [Abstract / concentration theorem] Abstract and the statement of the concentration theorem: the claim that the Gibbs policies concentrate around the unregularized optimal quote set is conditioned on a quadratic growth condition on the Hamiltonian in the relevant quote coordinates. No verification or proof is supplied that this condition is satisfied by the point-process intensities or the exponential-utility Hamiltonian in the Avellaneda-Stoikov specification (or in general), rendering the concentration result conditional on an unexamined modeling assumption.

    Authors: The concentration theorem is explicitly stated under the quadratic growth condition on the Hamiltonian, which is a standard modeling assumption for such concentration results to hold. We agree that an analytical verification for the specific intensities and utility in the Avellaneda-Stoikov model is not provided. In the revised manuscript, we will add a remark discussing the condition's plausibility for exponential intensities and bounded inventory, supported by the numerical evidence of quote concentration in Section 5. This addresses the concern without altering the conditional nature of the result. revision: partial

  2. Referee: [Performance bounds / fresh-sampling implementation] Performance-bound section (fresh-sampling relaxed implementation): the certainty-equivalent bounds are derived under the fresh-sampling scheme in which quotes are re-sampled at potential fill events. It is not shown that the same bounds continue to hold, or degrade gracefully, when the more common fixed-quote implementation over each time step h is used instead; this implementation detail is load-bearing for the practical relevance of the bounds.

    Authors: We acknowledge that the performance bounds are derived specifically for the fresh-sampling implementation, which facilitates the application of the entropy-regularized operator at event times. The fixed-quote implementation, while more common in practice, introduces additional approximation due to quote freezing over the interval h. In the revision, we will include a discussion of this distinction and argue that for small h the performance gap is controlled by the discretization error term O(h), though a full extension of the bounds to fixed quotes is beyond the current scope. This clarifies the practical relevance. revision: partial

standing simulated objections not resolved
  • Analytical proof that the quadratic growth condition holds for the Hamiltonian in the Avellaneda-Stoikov specification with exponential utility.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under external assumptions

full rationale

The paper constructs its own discrete entropy-regularized Bellman operator from first principles, proves uniform convergence of the value function at the stated rate directly from that operator, and derives policy concentration and performance bounds under an explicitly stated quadratic growth assumption on the Hamiltonian. This assumption is introduced as a modeling hypothesis rather than derived or fitted within the paper, and the non-commutativity argument for certainty equivalents is independent of it. No self-citations, self-definitional loops, fitted inputs relabeled as predictions, or renamings of known results appear in the derivation chain. The claims reduce to standard analysis of the introduced operator plus the external assumption, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard stochastic-control assumptions plus one modeling condition whose necessity is not independently verified in the abstract.

free parameters (2)
  • entropy parameter λ
    Introduced to regularize the Bellman operator; its scaling with discretization step h is part of the convergence statement.
  • time step h
    Discretization parameter whose limit is taken in the convergence theorem.
axioms (1)
  • domain assumption Quadratic growth condition on the Hamiltonian in the relevant quote coordinates
    Invoked to obtain policy concentration and performance bounds; location: paragraph stating the certainty-equivalent performance bounds for Gibbs policies.

pith-pipeline@v0.9.1-grok · 5790 in / 1459 out tokens · 23335 ms · 2026-06-29T23:17:50.988206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 12 canonical work pages

  1. [1]

    Thomas Ho and Hans R. Stoll. Optimal dealer pricing under transactions and return uncertainty. Journal of Financial Economics, 9(1):47–73, 1981. doi: 10.1016/0304-405X(81)90020-9

  2. [2]

    High-frequency trading in a limit order book.Quantitative Finance, 8(3):217–224, 2008

    Marco Avellaneda and Sasha Stoikov. High-frequency trading in a limit order book.Quantitative Finance, 8(3):217–224, 2008. doi: 10.1080/14697680701381228

  3. [3]

    Dealing with the inventory risk: A solution to the market making problem.Mathematics and Financial Economics, 7(4):477–507, 2013

    Olivier Gu´ eant, Charles-Albert Lehalle, and Joaquin Fernandez-Tapia. Dealing with the inventory risk: A solution to the market making problem.Mathematics and Financial Economics, 7(4):477–507, 2013. doi: 10.1007/s11579-012-0087-0

  4. [4]

    Cam- bridge University Press, 2015

    ´Alvaro Cartea, Sebastian Jaimungal, and Jos´ e Penalva.Algorithmic and High-Frequency Trading. Cam- bridge University Press, 2015. 19

  5. [5]

    Ziebart, J

    Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling interaction via the principle of maximum causal entropy. InProceedings of the 27th International Conference on Machine Learning, pages 1255–1262, 2010

  6. [6]

    A theory of regularized markov decision processes

    Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. InProceedings of the 36th International Conference on Machine Learning, pages 2160–2169, 2019

  7. [7]

    Soft actor-critic: Off-policy maxi- mum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maxi- mum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th Interna- tional Conference on Machine Learning, pages 1861–1870, 2018

  8. [8]

    High-frequency market-making with inventory constraints and directional bets, 2012

    Pietro Fodra and Mauricio Labadie. High-frequency market-making with inventory constraints and directional bets, 2012

  9. [9]

    Buy low, sell high: A high frequency trading perspective.SIAM Journal on Financial Mathematics, 5(1):415–444, 2014

    ´Alvaro Cartea, Sebastian Jaimungal, and Jason Ricci. Buy low, sell high: A high frequency trading perspective.SIAM Journal on Financial Mathematics, 5(1):415–444, 2014. doi: 10.1137/130911196

  10. [10]

    Market making with alpha signals.International Journal of Theo- retical and Applied Finance, 23(3):2050016, 2020

    ´Alvaro Cartea and Yixuan Wang. Market making with alpha signals.International Journal of Theo- retical and Applied Finance, 23(3):2050016, 2020. doi: 10.1142/S0219024920500168

  11. [11]

    The shadow price of latency: Improving intraday fill ratios in foreign exchange markets.SIAM Journal on Financial Mathematics, 12(1):254–294, 2021

    ´Alvaro Cartea and Leandro S´ anchez-Betancourt. The shadow price of latency: Improving intraday fill ratios in foreign exchange markets.SIAM Journal on Financial Mathematics, 12(1):254–294, 2021. doi: 10.1137/19M1258888

  12. [12]

    Figueroa-L´ opez, Chuyi Yu, and Yi Zhang

    Jonathan Ch´ avez-Casillas, Jos´ e E. Figueroa-L´ opez, Chuyi Yu, and Yi Zhang. Adaptive optimal market making strategies with inventory liquidation cost, 2024

  13. [13]

    Wiley, 1990

    Peter Whittle.Risk-Sensitive Optimal Control. Wiley, 1990

  14. [14]

    Fleming and William M

    Wendell H. Fleming and William M. McEneaney. Risk-sensitive control on an infinite time horizon. SIAM Journal on Control and Optimization, 33(6):1881–1915, 1995. doi: 10.1137/S0363012993258720

  15. [15]

    Bielecki and Stanley R

    Tomasz R. Bielecki and Stanley R. Pliska. Risk-sensitive dynamic asset management.Applied Mathe- matics and Optimization, 39(3):337–360, 1999. doi: 10.1007/s002459900110

  16. [16]

    Taming the noise in reinforcement learning via soft updates

    Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. InProceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pages 202–211, 2016

  17. [17]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InProceedings of the 34th International Conference on Machine Learning, pages 1352–1361, 2017

  18. [18]

    Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

    Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

  19. [19]

    Exploratory hjb equations and their convergence

    Wenpin Tang, Yuming Paul Zhang, and Xun Yu Zhou. Exploratory hjb equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191–3216, 2022. doi: 10.1137/21M1448185

  20. [20]

    Yanwei Jia and Xun Yu Zhou.q-learning in continuous time.Journal of Machine Learning Research, 24(161):1–61, 2023

  21. [21]

    Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty.Ap- plied Mathematics and Optimization, 93:58, 2026

    Yanwei Jia. Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty.Ap- plied Mathematics and Optimization, 93:58, 2026. doi: 10.1007/s00245-026-10412-4

  22. [22]

    Risk-sensitiveq-learning in continuous time with application to dynamic portfolio selection, 2025

    Chuhan Xie. Risk-sensitiveq-learning in continuous time with application to dynamic portfolio selection, 2025. 20

  23. [23]

    Reconciling discrete-time mixed policies and continuous-time relaxed controls in reinforcement learning and stochastic control, 2025

    Ren´ e Carmona and Mathieu Lauri` ere. Reconciling discrete-time mixed policies and continuous-time relaxed controls in reinforcement learning and stochastic control, 2025

  24. [24]

    Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning, 2025

    Yanwei Jia, Du Ouyang, and Yufei Zhang. Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning, 2025

  25. [25]

    Discretization error from regularized reinforcement learning to continuous-time stochastic control, 2026

    Huyˆ en Pham, Yuming Paul Zhang, and Yuhua Zhu. Discretization error from regularized reinforcement learning to continuous-time stochastic control, 2026

  26. [26]

    Market making via rein- forcement learning

    Thomas Spooner, John Fearnley, Rahul Savani, and Andreas Koukorinis. Market making via rein- forcement learning. InProceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 434–442, 2018

  27. [27]

    Reinforcement learning for high-frequency market making

    Ye-Sheen Lim and Denise Gorse. Reinforcement learning for high-frequency market making. InPro- ceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pages 521–526, 2018

  28. [28]

    Robust market making via adversarial reinforcement learning, 2020

    Thomas Spooner and Rahul Savani. Robust market making via adversarial reinforcement learning, 2020

  29. [29]

    Logarithmic regret in the ergodic avellaneda–stoikov market making model.SIAM Journal on Financial Mathematics, 2026

    Jialun Cao, David ˇSiˇ ska, Lukasz Szpruch, and Tanut Treetanthiploet. Logarithmic regret in the ergodic avellaneda–stoikov market making model.SIAM Journal on Financial Mathematics, 2026. doi: 10. 48550/arXiv.2409.02025. Forthcoming

  30. [30]

    Reinforcement learning in high-frequency market making, 2024

    Yuheng Zheng and Zihan Ding. Reinforcement learning in high-frequency market making, 2024

  31. [31]

    Resolving latency and inventory risk in market making with reinforcement learning, 2025

    Junzhe Jiang, Chang Yang, Xinrun Wang, Zhiming Li, Xiao Huang, and Bo Li. Resolving latency and inventory risk in market making with reinforcement learning, 2025

  32. [32]

    Deep reinforcement learning in non-markov market-making.Risks, 13(3):40, 2025

    Luca Lalor and Anatoliy Swishchuk. Deep reinforcement learning in non-markov market-making.Risks, 13(3):40, 2025. doi: 10.3390/risks13030040

  33. [33]

    Reinforcement learning-based market making as a stochastic control on non-stationary limit order book dynamics, 2025

    Rafael Zimmer and Oswaldo Luiz do Valle Costa. Reinforcement learning-based market making as a stochastic control on non-stationary limit order book dynamics, 2025

  34. [34]

    Wasserstein robust market making via entropy regularization, 2025

    Zhou Fang and Arie Israel. Wasserstein robust market making via entropy regularization, 2025. A Auxiliary Estimates for the Market-Making Model Throughout the appendices,Cdenotes a finite constant depending only on T, Q, γ, σ, Λ, δ, δ,Φ, η, m −, m+, and on theC 1-bounds of Λa,Λ b on [δ, δ]. Its value may change from line to line, but it is independent ofh...