Zero-shot adaptation to order book dynamics
Pith reviewed 2026-05-22 07:50 UTC · model grok-4.3
The pith
Market makers can adapt to new regimes by feeding recent rewards into an unchanged Avellaneda-Stoikov HJB map.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By separating market-state parameters from a low-dimensional objective vector derived from recent realized rewards, the original Avellaneda-Stoikov HJB structure can be reused without modification. The HJB forward map converts the objective vector into optimal bid and ask quotes through scalarization of future reward features, preserving analytical tractability across arbitrary regime shifts and trading objectives.
What carries the argument
The HJB forward map that scalarizes future reward features from the low-dimensional objective vector into bid and ask quotes while the market-state parameters remain fixed.
If this is right
- Optimal quotes for any new trading objective can be obtained instantly once the objective vector is estimated from recent rewards.
- Market regime changes require only an update to the low-dimensional market-state parameters while the HJB map stays fixed.
- The computational cost of producing quotes remains the same as the original non-adaptive Avellaneda-Stoikov solver.
- Trading decisions stay interpretable because the objective is explicitly expressed as a scalarization of observable reward features.
Where Pith is reading between the lines
- The same separation pattern could be applied to other stochastic control problems in finance to enable regime-adaptive control without re-deriving the value function.
- If the low-dimensional objective vector generalizes across assets, the method might support zero-shot transfer of market-making policies between related instruments.
- Backtesting the approach on real order-book data with documented volatility jumps would directly measure how much performance is retained relative to full re-optimization.
Load-bearing premise
Recent realized rewards can be compressed into a low-dimensional objective vector independent of the market-state parameters, and the HJB forward map remains valid under this separation for arbitrary regime shifts.
What would settle it
In a simulation with known abrupt regime shifts, if the quotes generated by the separated objective vector produce materially lower cumulative reward than those obtained by fully re-solving the HJB equation on the new regime, the separation would fail to preserve optimality.
Figures
read the original abstract
We describe an adaptive market-making architecture that preserves the analytical structure of the Avellaneda--Stoikov framework while introducing a successor measure-style adaptation mechanism. In our paper we keep Avellaneda--Stoikov fast Hamilton--Jacobi--Bellman structure and make it adaptive to changing market regimes and trading objectives. The central idea is to separate market dynamics from the trading objective. The market state determines a low-dimensional set of Avellaneda--Stoikov parameters, while recent realized rewards determine a low-dimensional objective vector. The HJB forward map then converts this objective into optimal bid and ask quotes through a scalarization of future reward features.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a zero-shot adaptive market-making architecture that retains the analytical structure of the Avellaneda-Stoikov framework. Market state determines a low-dimensional set of AS parameters while recent realized rewards determine a low-dimensional objective vector; an HJB forward map then scalarizes future reward features to produce optimal bid and ask quotes, enabling adaptation to regime shifts and changing objectives without retraining.
Significance. If the separation of dynamics from objectives can be shown to preserve optimality and the forward map remains valid under arbitrary regime shifts, the method would provide a computationally lightweight way to adapt market-making policies while retaining the closed-form advantages of the original AS HJB solution. This addresses a practical gap in high-frequency trading where both market microstructure and reward preferences evolve.
major comments (2)
- [Abstract] Abstract: the central claim that the HJB forward map converts the separated objective vector into optimal quotes rests on an unstated derivation; no explicit PDE, intensity function λ(δ), or scalarization step is supplied, making it impossible to verify whether the construction recovers the true optimum of the joint dynamics-reward problem when arrival rates or volatility change.
- [Abstract] Abstract: the assumption that recent realized rewards can be compressed into a low-dimensional objective vector that remains independent of the current market-state parameters is load-bearing for zero-shot adaptation, yet the original AS value function couples the control offsets to both the generator and the running reward; regime shifts that alter the state process therefore risk making the separated map suboptimal.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address the two major points raised regarding the abstract in turn, providing clarifications drawn from the full manuscript and indicating planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the HJB forward map converts the separated objective vector into optimal quotes rests on an unstated derivation; no explicit PDE, intensity function λ(δ), or scalarization step is supplied, making it impossible to verify whether the construction recovers the true optimum of the joint dynamics-reward problem when arrival rates or volatility change.
Authors: The abstract is intentionally concise and therefore omits the explicit equations. The underlying PDE is the standard Avellaneda-Stoikov Hamilton-Jacobi-Bellman equation with intensity λ(δ) = A exp(−k δ) on each side. The scalarization step incorporates the low-dimensional objective vector as linear weights on the inventory-penalty and terminal-wealth features inside the running reward; the successor measure then yields the adjusted controls while the market-state parameters remain fixed. This construction is derived in Section 3. We will revise the abstract to include a one-sentence reference to the PDE, the intensity form, and the scalarization weights so that the forward-map claim can be verified without reading the body. revision: partial
-
Referee: [Abstract] Abstract: the assumption that recent realized rewards can be compressed into a low-dimensional objective vector that remains independent of the current market-state parameters is load-bearing for zero-shot adaptation, yet the original AS value function couples the control offsets to both the generator and the running reward; regime shifts that alter the state process therefore risk making the separated map suboptimal.
Authors: The original AS value function does couple controls to both generator and reward. Our separation exploits the successor measure to treat the objective vector as an independent rescaling of the reward features; the market-state parameters (volatility, arrival-rate coefficients) are updated separately from recent reward statistics. Under this decomposition the forward map remains optimal for the current dynamics and the new objective. We will add an explicit paragraph in the methods section stating the precise conditions under which the independence holds and noting the limitation when a regime shift also alters the reward structure itself. revision: yes
Circularity Check
No significant circularity; separation and HJB map rely on external AS framework
full rationale
The paper presents a modeling choice to separate market-state parameters (from dynamics) from an objective vector (from recent rewards), followed by application of the HJB forward map drawn from the established Avellaneda-Stoikov framework. No equations or derivations are shown that reduce a claimed prediction to a fitted parameter by construction, nor is there a self-definitional loop or load-bearing self-citation chain. The central construction introduces an adaptation mechanism while explicitly preserving the external analytical structure, rendering the derivation self-contained against independent benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Avellaneda-Stoikov framework supplies a valid and fast HJB solution for market making under standard assumptions on price dynamics and inventory risk.
invented entities (1)
-
successor measure-style adaptation mechanism
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The central idea is to separate market dynamics from the trading objective. The market state determines a low-dimensional set of Avellaneda–Stoikov parameters, while recent realized rewards determine a low-dimensional objective vector. The HJB forward map then converts this objective into optimal bid and ask quotes through a scalarization of future reward features.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Zero-shot off-policy learning.arXiv preprint arXiv:2602.01962, 2026
Arip Asadulaev, Maksim Bobrin, Salem Lahlou, Dmitry Dylov, Fakhri Karray, and Martin Takac. Zero-shot off-policy learning.arXiv preprint arXiv:2602.01962, 2026
work page internal anchor Pith review arXiv 2026
-
[2]
High-frequency trading in a limit order book.Quantitative Finance, 8(3):217–224, 2008
Marco Avellaneda and Sasha Stoikov. High-frequency trading in a limit order book.Quantitative Finance, 8(3):217–224, 2008
work page 2008
-
[3]
Optimal market making.Applied Mathematical Finance, 24(2):112–154, 2017
Olivier Gu´ eant. Optimal market making.Applied Mathematical Finance, 24(2):112–154, 2017
work page 2017
- [4]
-
[5]
arXiv preprint arXiv:2504.11054 , year=
Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu, Alessandro Lazaric, and Matteo Pirotta. Zero-shot whole-body humanoid control via behavioral foundation models.arXiv preprint arXiv:2504.11054, 2025
-
[6]
Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards.Advances in Neural Information Processing Systems, 34:13–23, 2021. 9 A Practical Specifications A.1 Stabilization and Corrections The pure FB estimate bzt =E ρt rH(s, a)Bψ(s, a) can be noisy, especially in short windows. A stable production version should therefore mix it ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.