Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression

Shangzhe Li; Wenwu Yu; Xinming Gao; Yujin Cai

arxiv: 2511.11973 · v2 · submitted 2025-11-15 · 💻 cs.LG

Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression

Xinming Gao , Shangzhe Li , Yujin Cai , Wenwu Yu This is my paper

Pith reviewed 2026-05-17 22:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords offline reinforcement learningquantile regressionextreme Q-learningvalue regularizationhyperparameter consistencystable trainingD4RL benchmark

0 comments

The pith

Quantile regression for the temperature coefficient makes offline extreme Q-learning stable with fixed hyperparameters across datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix the instability and heavy tuning needs of Extreme Q-Learning in offline reinforcement learning. It does this by estimating the key temperature coefficient using quantile regression based on mild assumptions about the error distribution. Adding a simple value regularization step further steadies the training process. As a result, the new Quantile Q-Learning method performs well on standard test suites while using the same settings for every task and domain.

Core claim

The authors propose Quantile Q-Learning, which revisits Extreme Q-Learning by using quantile regression to estimate the temperature coefficient under mild assumptions and introduces value regularization with mild generalization to improve training stability, leading to competitive or superior performance on D4RL and NeoRL2 benchmarks with consistent hyperparameters.

What carries the argument

Quantile regression estimator for the temperature coefficient in the extreme value modeling of Bellman errors, together with an added value regularization technique.

Load-bearing premise

Quantile regression accurately estimates the temperature coefficient when the Bellman errors follow the assumed mild conditions, and the value regularization works without major loss of performance.

What would settle it

Applying the method to a dataset where the Bellman error tails do not match the mild assumptions for quantile regression and finding that training becomes unstable or requires new hyperparameter choices.

Figures

Figures reproduced from arXiv: 2511.11973 by Shangzhe Li, Wenwu Yu, Xinming Gao, Yujin Cai.

**Figure 2.** Figure 2: QQL Performance and Ablation Analysis. We visualize the following results: (a) Training stability comparison between QQL and MXQL. (b) Q-value plots demonstrating the stabilization effect of value regularization and conservative estimation. Domain Gym Locomotion Adroit AntMaze XQL (Dataset-specific tuning) 83.7±2.7 40.6±6.3 83.8±6.9 XQL (Consistent per domain) 75.5±3.6 38.6±4.5 69.6±10.7 XQL (Consistent ac… view at source ↗

**Figure 3.** Figure 3: Ablation Studies on Value Regulation and Conservative Estimation Comparison of QQL performance to its variant without value regulation (w/o VR) and its variant without conservative estimation (w/o CE) adapted in the methodology section. The hyperparameters are consistent across variants. These results demonstrate that both VR and CE are essential for effective offline RL. Their complementary effects yield … view at source ↗

**Figure 4.** Figure 4: Toy Example for β(s) Scale. Empirical distributions of −Q(s, a), along with the corresponding fitted Gumbel distributions and their estimated location, scale, and goodness-of-fit p-values. 5.5 Experiments on β(s) Scale In Definition 1, we actually assume that the β(s) scale of Assumption 1 and Assumption 2 is similar, as discussed in the remark of Definition 1. In this section, we provide empirical evidenc… view at source ↗

read the original abstract

Offline reinforcement learning (RL) enables policy learning from fixed datasets without further environment interaction, making it particularly valuable in high-risk or costly domains. Extreme $Q$-Learning (XQL) is a recent offline RL method that models Bellman errors using the Extreme Value Theorem, yielding strong empirical performance. However, XQL and its stabilized variant MXQL suffer from notable limitations: both require extensive hyperparameter tuning specific to each dataset and domain, and also exhibit instability during training. To address these issues, we proposed a principled method to estimate the temperature coefficient $\beta$ via quantile regression under mild assumptions. To further improve training stability, we introduce a value regularization technique with mild generalization, inspired by recent advances in constrained value learning. Experimental results demonstrate that the proposed algorithm achieves competitive or superior performance across a range of benchmark tasks, including D4RL and NeoRL2, while maintaining stable training dynamics and using a consistent set of hyperparameters across all datasets and domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper automates beta estimation in Extreme Q-Learning with quantile regression plus value regularization, but the stability and no-tuning claims rest on assumptions that still need checking.

read the letter

The main thing to know is that they replace the per-dataset beta tuning in XQL and MXQL with quantile regression to estimate it directly from data, then add a value regularization term to reduce training instability. That combination is the actual new piece on top of the existing Extreme Q-Learning line. It does a solid job naming the real usability problems—tuning burden and unstable runs—that keep these methods from being straightforward in high-stakes offline settings. If the quantile step works under the stated mild assumptions, it could cut down on the hyperparameter search that currently makes XQL less practical for costly domains. The experiments claim competitive or better results on D4RL and NeoRL2 with one fixed hyperparameter set and steadier training, which would be useful if it holds. The soft spots are around verification. The abstract leans on mild assumptions for both the beta estimation and the regularization's generalization, yet without the exact quantile regression procedure, how it interacts with the function approximators, or fuller experimental controls, it's difficult to tell whether the reported gains and hyperparameter consistency are robust or partly dataset-specific. The stress-test concern about those assumptions is fair; if they don't hold for the neural nets on these benchmarks, the advantages could shrink. This is aimed at offline RL researchers and practitioners who care about reducing tuning sensitivity rather than a broad audience. It shows clear engagement with the prior XQL limitations, so it deserves a serious referee to examine the derivations and results in detail. I'd send it to review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Quantile Q-Learning as an extension of Extreme Q-Learning for offline RL. It estimates the temperature coefficient β via quantile regression under mild assumptions to remove per-dataset hyperparameter tuning required by XQL/MXQL, and adds a value regularization term with mild generalization to stabilize training. The central empirical claim is that the resulting algorithm achieves competitive or superior performance on D4RL and NeoRL2 benchmarks while exhibiting stable dynamics and using one fixed hyperparameter set across all datasets and domains.

Significance. If the assumptions on β estimation and value regularization hold and the reported gains are reproducible, the work would meaningfully improve the practicality of offline RL methods that rely on extreme value modeling by reducing tuning burden and instability. The data-driven estimation of β via quantile regression is a principled step that could be adopted more broadly.

major comments (2)

[β estimation via quantile regression] The section describing β estimation via quantile regression: the claim that this procedure operates under 'mild assumptions' and yields a tuning-free β is load-bearing for the consistent-hyperparameter claim, yet the exact quantile level, regression objective, any implicit regularization, and empirical verification that the estimate remains stable across D4RL/NeoRL2 domains are not specified. Without these details the support for replacing XQL/MXQL tuning cannot be assessed.
[value regularization technique] The section introducing the value regularization technique: the regularization is asserted to stabilize training under 'mild generalization,' but the precise mathematical form of the term, its placement in the Bellman update or loss, and any ablation isolating its effect on the reported stability are absent. This directly affects the stability and hyperparameter-consistency claims.

minor comments (2)

[Abstract] Abstract: the repeated use of 'mild assumptions' and 'mild generalization' without a brief qualifier or section pointer reduces immediate clarity for readers.
[Experiments] Experimental results: tables or figures should include standard errors or statistical tests to substantiate statements of 'competitive or superior' performance across the full set of tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation for major revision. We address each major comment below with clarifications and commit to revisions that improve transparency without altering the core contributions.

read point-by-point responses

Referee: [β estimation via quantile regression] The section describing β estimation via quantile regression: the claim that this procedure operates under 'mild assumptions' and yields a tuning-free β is load-bearing for the consistent-hyperparameter claim, yet the exact quantile level, regression objective, any implicit regularization, and empirical verification that the estimate remains stable across D4RL/NeoRL2 domains are not specified. Without these details the support for replacing XQL/MXQL tuning cannot be assessed.

Authors: We agree that greater specificity is needed to fully support the tuning-free claim. In the revision we will explicitly state the quantile level, the precise regression objective, confirm the lack of additional implicit regularization, and add empirical verification (including plots) demonstrating stability of the β estimates across D4RL and NeoRL2 domains. These details will be highlighted in Section 3.2 and the appendix. revision: yes
Referee: [value regularization technique] The section introducing the value regularization technique: the regularization is asserted to stabilize training under 'mild generalization,' but the precise mathematical form of the term, its placement in the Bellman update or loss, and any ablation isolating its effect on the reported stability are absent. This directly affects the stability and hyperparameter-consistency claims.

Authors: We appreciate this observation. The revised manuscript will include the exact mathematical form of the value regularization term, specify its placement within the loss and Bellman update, and add an ablation study isolating its effect on training stability. These changes will be made in Section 3.3 and the supplementary material to directly bolster the stability claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method introduces independent estimation and regularization steps validated on external benchmarks

full rationale

The paper defines a new procedure for estimating the temperature coefficient β via quantile regression under stated mild assumptions and adds a value regularization term for stability. These components are presented as solutions to the hyperparameter sensitivity and instability of prior XQL/MXQL methods. Performance claims rest on empirical results across external benchmark suites (D4RL and NeoRL2) rather than any reduction of outputs to the estimation procedure by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing; the derivation chain remains self-contained with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two mild assumptions stated in the abstract: one enabling quantile regression for β and one supporting the generalization of the value regularization. No explicit free parameters are introduced because the method uses a single consistent hyperparameter set; no new entities are postulated.

axioms (2)

domain assumption Mild assumptions suffice for estimating the temperature coefficient β via quantile regression
Invoked in the abstract as the basis for the principled estimation method.
domain assumption The value regularization technique possesses mild generalization properties
Stated in the abstract as justification for the stability improvement.

pith-pipeline@v0.9.0 · 5471 in / 1463 out tokens · 47016 ms · 2026-05-17T22:17:33.805648+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InProceedings of the 36th International Conference on Machine Learning (ICML), pp. 2052–2062. PMLR,

work page 2052
[2]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

URLhttps://arxiv.org/abs/1910.00177. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.Nature, 529(7587):484–489,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[3]

J., et al

doi: 10.1038/nature16961. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, second edition,

work page doi:10.1038/nature16961
[4]

To address this issue, ex- isting methods like Advantage-Weighted Regression (AWR) aim to mitigate such inherent conservative- ness

A Derivation of Policy Objective A common challenge in policy learning arises when the sampling policyµ(· | ·)is suboptimal, which often results in overly conservative estimates from in-sample approaches. To address this issue, ex- isting methods like Advantage-Weighted Regression (AWR) aim to mitigate such inherent conservative- ness. Specifically, the A...

work page 1935
[5]

across various offline reinforcement learning (RL) tasks from the D4RL and NeoRL2 benchmark. Dataset-specificβvalues are listed in Tables 5, 6, 7 and 8, corresponding to the 15 Task (Variant) halfcheetah hopper walker2d medium 1.0 5.0 10.0 medium-rep 1.0 2.0 5.0 medium-exp 1.0 2.0 2.0 Table 5: XQL temperature settings (β) with dataset-specific tuning on D...

work page 2023

[1] [1]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InProceedings of the 36th International Conference on Machine Learning (ICML), pp. 2052–2062. PMLR,

work page 2052

[2] [2]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

URLhttps://arxiv.org/abs/1910.00177. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.Nature, 529(7587):484–489,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[3] [3]

J., et al

doi: 10.1038/nature16961. Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, second edition,

work page doi:10.1038/nature16961

[4] [4]

To address this issue, ex- isting methods like Advantage-Weighted Regression (AWR) aim to mitigate such inherent conservative- ness

A Derivation of Policy Objective A common challenge in policy learning arises when the sampling policyµ(· | ·)is suboptimal, which often results in overly conservative estimates from in-sample approaches. To address this issue, ex- isting methods like Advantage-Weighted Regression (AWR) aim to mitigate such inherent conservative- ness. Specifically, the A...

work page 1935

[5] [5]

across various offline reinforcement learning (RL) tasks from the D4RL and NeoRL2 benchmark. Dataset-specificβvalues are listed in Tables 5, 6, 7 and 8, corresponding to the 15 Task (Variant) halfcheetah hopper walker2d medium 1.0 5.0 10.0 medium-rep 1.0 2.0 5.0 medium-exp 1.0 2.0 2.0 Table 5: XQL temperature settings (β) with dataset-specific tuning on D...

work page 2023