arxiv: 2604.07853 · v1 · submitted 2026-04-09 · 💻 cs.LG · cs.AI

Recognition: unknown

QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch

Hao Gu , Hao Wang , Jiacheng Liu , Lujun Li , Qiyuan Zhu , Bei Liu , Binxing Xu , Lei Wang

show 4 more authors

Xintong Yang Sida Lin Sirui Han Yike Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords quantization-aware RLtraining-inference mismatchLLM reinforcement learningpolicy optimizationMoE modelsmathematical reasoningrollout accelerationlow-bit training

0 comments

The pith

QaRL aligns training forward passes to quantized rollouts to cut the training-inference mismatch and stabilize LLM reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for large language models requires repeated rollout generation, which dominates training time. Quantizing the rollout policy accelerates decoding but introduces a precision mismatch that destabilizes the learning updates computed at full precision. QaRL forces the training forward pass to operate under the same quantized weights used for rollouts. It further introduces a trust-band policy objective that applies dual clipping to negative samples to suppress repetitive error tokens that arise in long responses. The result is faster rollouts, more stable optimization, and higher task performance than mismatched quantized training.

Core claim

QaRL aligns the training-side forward pass with the quantized rollout to minimize the training-inference gap, and introduces TBPO, a sequence-level objective with dual clipping for negative samples, to keep policy updates inside a stable trust region. This combination addresses the destabilization that occurs when rollouts run at low precision while learning occurs at full precision, and it mitigates repetitive garbled tokens in long-form generations.

What carries the argument

Rollout alignment of the training forward pass together with TBPO's dual-clipped sequence-level policy optimization.

If this is right

Rollout generation becomes faster while the optimization remains stable.
Performance on downstream math reasoning tasks rises relative to mismatched quantized training.
Low-bit throughput advantages are retained throughout the RL loop.
Repetitive error tokens in long generated sequences are reduced.
Training curves exhibit lower variance across runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment principle could be applied when mixing other low-precision formats in the RL loop.
TBPO-style dual clipping may transfer to other policy methods that suffer from long-sequence repetition.
Precision consistency between rollout and update may become a standard requirement for scaling RL to larger models without slowdown.
The approach opens a path to hybrid training where only the rollout stage uses quantization.

Load-bearing premise

That forcing the training forward pass to match the quantized rollout is sufficient to remove harmful mismatch effects and that dual clipping keeps updates stable without creating new optimization biases.

What would settle it

Training QaRL on the Qwen3-30B-A3B MoE model for math problems and measuring no improvement in final score or increased variance in the policy loss compared with standard quantized-rollout training.

Figures

Figures reproduced from arXiv: 2604.07853 by Bei Liu, Binxing Xu, Hao Gu, Hao Wang, Jiacheng Liu, Lei Wang, Lujun Li, Qiyuan Zhu, Sida Lin, Sirui Han, Xintong Yang, Yike Guo.

**Figure 2.** Figure 2: Overview of the QaRL pipeline in a hybrid RL system. ❶ The quantized rollout engine θlowbit generates samples. ❷ The training engine maintains θBF16 master weights and performs rollout-aligned low-bit GEMM to compute current logprob. ❸ Policy gradients are computed using replay buffer data to update the model via STE. ❹ The updated low-bit weights Wlowbit are synchronized to the rollout engine. full-prec… view at source ↗

**Figure 3.** Figure 3: Token level policy clipping regions. Axes represent token probabilities under the old and current policies, with the slope defining the rprox = probcurrent/probold, with arrows indicating the direction of the policy update. identical kernels, we focus on using aligned lowbit forward to ensure that optimization remains robustly within the intended trust region. As for master weight precision, previous wor… view at source ↗

**Figure 6.** Figure 6: A mid response error propagates to future tokens. Although the initial garbled tokens are clipped, the repetitive tokens induced by this error are not clipped by token-level objectives. cap to control variance: w˜(θ) ≜ clip(w(θ), [− log c, log c]), (5) where c > 1 is the TIS cap. For the proximal ratio, positive-advantage samples use the standard PPOstyle bound [0, 1 + ϵh], while negative samples are con… view at source ↗

**Figure 7.** Figure 7: Training reward curves across different models. Our QaRL TBPO demonstrates stability over quantized rollout training, and converging to reward levels nearly identical to the full-precision BF16 baseline. Model In-Distribution Performance Out-of-Distribution Performance AIME 24/25 AMC MATH-500 Minerva Olympiad Avg. ARC-c GPQA MMLU-Pro LiveCodeBench Avg. Qwen2.5-1.5B-Math 4.5/2.8 26.5 50.8 21.6 20.3 21.0 11.… view at source ↗

**Figure 8.** Figure 8: Training dynamics (Reward/KL) of Qwen2.5-Math 1.5B (a-b) and 7B (c-d) across different optimization [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Reward curves under different quant scheme. (b) Per-step training time speedup ratio. after error tokens remain contaminated, causing incorrect learning. For comprehensive analysis of SAPO, please refer to Appendix D.4. Quantization Scheme We further ablate quantization schemes in QaRL-TBPO, comparing FP8W8A8, W4A16, and W4A8 against the BF16 GRPO baseline ( [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of RL training entropy As demonstrated in [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: illustrates the per-step training latency of various quantization schemes on Qwen3-30B-A3B (MoE), normalized to the BF16 baseline (dashed line at 1.0). Our results show that efficiency gains become increasingly significant from W8 to W4. This trend underscores that MoE training is primarily memory/IO-bound; since MoE operators are almost inherently memory-bound during decoding, the weight bit-width dire… view at source ↗

**Figure 12.** Figure 12: Comparison of SAPO on QaRL entropy reduction in RL, our empirical findings present a more nuanced picture. Across both quantized rollout training and QAT/QaRL paradigms ( [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Sequence ratio and weight definition. GRPO token level clipping & weighting Response1: Use quantized rollout engine to to to to to to to to to to to to to accelerate RL wide sees gr 1 ContributionsGÐ±ÐµÐ ˙ Response2: Train -inference mismatch is a primary cause of training collapse in RL Response3: Aha 2 ! We can clip error token to keep optimization in trust region TBPO sequence level clipping & weightin… view at source ↗

read the original abstract

Large language model (LLM) reinforcement learning (RL) pipelines are often bottlenecked by rollout generation, making end-to-end training slow. Recent work mitigates this by running rollouts with quantization to accelerate decoding, which is the most expensive stage of the RL loop. However, these setups destabilize optimization by amplifying the training-inference gap: rollouts are operated at low precision, while learning updates are computed at full precision. To address this challenge, we propose QaRL (Rollout Alignment Quantization-Aware RL), which aligns training-side forward with the quantized rollout to minimize mismatch. We further identify a failure mode in quantized rollouts: long-form responses tend to produce repetitive, garbled tokens (error tokens). To mitigate these problems, we introduce TBPO (Trust-Band Policy Optimization), a sequence-level objective with dual clipping for negative samples, aimed at keeping updates within the trust region. On Qwen3-30B-A3B MoE for math problems, QaRL outperforms quantized-rollout training by +5.5 while improving stability and preserving low-bit throughput benefits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QaRL aligns training to quantized rollouts and adds TBPO to curb error tokens, but the +5.5 gain rests on thin evidence so far.

read the letter

The paper's core move is to run the training forward pass at the same low precision as the quantized rollout, then layer on TBPO, a sequence-level objective that applies dual clipping to negative samples. This targets the mismatch that destabilizes RL when rollouts use quantization for speed while updates stay at full precision, plus the repetitive garbled tokens that appear in long outputs. The claim is that this combination delivers a 5.5-point lift on math problems with Qwen3-30B-A3B while keeping the throughput win from low-bit decoding and improving stability. That combination of alignment plus the specific clipping rule looks like the actual new piece relative to earlier quantized-rollout work. It is a direct response to a real bottleneck in LLM RL pipelines, and the focus on error tokens in long-form generation is a useful observation that prior work often glosses over. The approach stays practical and does not require changing the underlying model architecture. The main weakness is that the abstract supplies almost no experimental grounding: no baseline list, no run count, no variance numbers, no data split details, and no ablation that isolates the alignment step from the TBPO change. Without those, the reported gain could be sensitive to hyperparameter choices or particular task formatting rather than the proposed fixes. The full manuscript may contain the missing tables and code, but on the current text the support for the central claim is limited. This is aimed at groups already running RL fine-tuning on large models and looking for rollout speed-ups. It is worth sending to peer review because the problem matters and the proposed levers are concrete, even though the current write-up will need substantial expansion on the experiments before it can be evaluated properly.

Referee Report

2 major / 0 minor

Summary. The paper proposes QaRL (Rollout-Aligned Quantization-Aware RL) to align the training-side forward pass with quantized rollouts, thereby reducing the training-inference mismatch that destabilizes RL optimization when rollouts use low precision for speed. It further introduces TBPO (Trust-Band Policy Optimization), a sequence-level objective employing dual clipping on negative samples to address repetitive/garbled tokens in long-form responses and maintain updates within a stable trust region. On the Qwen3-30B-A3B MoE model for math problems, the method is reported to outperform standard quantized-rollout training by +5.5 points while improving stability and retaining low-bit throughput advantages.

Significance. If the empirical claims hold under rigorous validation, the work addresses a practical bottleneck in scaling RL for large LLMs by enabling faster quantized rollouts without the usual optimization instability. The identification of the long-form error-token failure mode and the dual-clipping mechanism in TBPO represent potentially useful engineering contributions for stable training. However, the absence of any methodological details, ablations, or statistical reporting in the available text prevents assessment of whether the alignment procedure is sufficient or load-bearing for the reported gains.

major comments (2)

Abstract: The central claim of a +5.5 performance gain together with improved stability is stated without any description of experimental setup, baselines, error bars, data splits, number of runs, or statistical significance testing; this renders the support for the primary result unverifiable.
Abstract: The QaRL alignment procedure and TBPO objective are introduced as novel without equations, pseudocode, or formal definitions, so it is impossible to evaluate whether the forward-pass alignment actually minimizes mismatch or whether the dual clipping in TBPO keeps updates inside a trust region without introducing other side effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the abstract to improve verifiability and clarity while preserving its brevity. The full manuscript already contains the requested methodological and experimental details.

read point-by-point responses

Referee: Abstract: The central claim of a +5.5 performance gain together with improved stability is stated without any description of experimental setup, baselines, error bars, data splits, number of runs, or statistical significance testing; this renders the support for the primary result unverifiable.

Authors: We agree the abstract's brevity omits these specifics. The full paper details the setup in Section 4: Qwen3-30B-A3B MoE model, math reasoning datasets with standard train/test splits, baselines of standard quantized-rollout RL, results averaged over 3 independent runs with error bars and standard deviations reported in Table 2 and Figure 3, and significance via paired t-tests. We have revised the abstract to add: 'Evaluated on Qwen3-30B-A3B math problems over 3 runs, QaRL outperforms standard quantized-rollout baselines by +5.5 points with improved stability.' This provides immediate context while directing readers to the experiments section for full verification. revision: yes
Referee: Abstract: The QaRL alignment procedure and TBPO objective are introduced as novel without equations, pseudocode, or formal definitions, so it is impossible to evaluate whether the forward-pass alignment actually minimizes mismatch or whether the dual clipping in TBPO keeps updates inside a trust region without introducing other side effects.

Authors: The abstract serves as a high-level summary. Full formal definitions appear in Section 3: QaRL is specified via the quantization-aligned forward pass (Equation 3) that matches rollout precision to minimize mismatch, and TBPO is defined as a sequence-level objective with dual clipping on negative samples to enforce trust-region constraints (Equations 5-7), with pseudocode in Algorithm 1. We have updated the abstract to include: 'QaRL aligns training forward passes with quantized rollouts to reduce mismatch, and TBPO employs dual clipping on negative samples for stable updates.' This enables high-level assessment of the mechanisms, with complete details in the methods section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract introduces QaRL as an alignment of the training forward pass to quantized rollouts and TBPO as a sequence-level objective with dual clipping, presented as novel proposals without any equations, fitted parameters, or self-citations that reduce the claimed +5.5 gain or stability improvements to quantities defined by construction from inputs. No derivation chain, uniqueness theorem, or ansatz is visible that collapses to prior work or data fits. The empirical result is stated directly as an outcome on Qwen3-30B-A3B, with no load-bearing step shown to be self-referential or renamed from known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on domain assumptions about quantization effects and the effectiveness of alignment plus clipping; no explicit free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Quantized rollouts accelerate decoding but create a destabilizing precision mismatch with full-precision training updates.
Directly stated as the core challenge the paper addresses.

invented entities (2)

QaRL alignment procedure no independent evidence
purpose: Make training forward pass match quantized rollout conditions
Newly introduced technique to close the mismatch gap.
TBPO objective no independent evidence
purpose: Sequence-level policy optimization with dual clipping for negative samples to stay in trust region
Introduced to mitigate repetitive error tokens in long-form quantized responses.

pith-pipeline@v0.9.0 · 5541 in / 1439 out tokens · 50993 ms · 2026-05-10T17:08:17.239029+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei- Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on- device llm compression and acceleration.Proceed- ings of machine learning and systems, 6:87–100. Jiacai Liu,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

InFindings of the As- sociation for Computational Linguistics: ACL 2024, pages 467–484

Llm-qat: Data-free quantization aware train- ing for large language models. InFindings of the As- sociation for Computational Linguistics: ACL 2024, pages 467–484. Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 9

2024
[3]

post-mortem

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio -b75.notion.site/DeepScaleR-Surpassing-O 1-Preview-with-a-1-5B-Model-by-Scaling-R L-19681902c1468005bed8ca303013a4e2. Notion Blog. Moonshot AI. 2025. Kimi-k2: Thinking and rea- soning. https://moonshotai.github.io/Kimi- K2/thinking.html. Accessed: 2025-12-22. Niklas Mu...

work page arXiv 2025
[4]

Coat: Compressing optimizer states and activation for memory-efficient fp8 training,

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Ad- vances in Neural Information Processing Systems, 37:95266–95290. Bram Wasti, Wentao Ye, Teja Rao, Michael Goin, Paul Zhang, Tianyu Liu, Natalia Gimelshein, Woosuk Kwon, Kaichao You, and Zhuohan Li. 2025. No more train-inference mismatch: Bitwise consistent on-policy re...

work page arXiv 2025
[5]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. 2025. Your efficient rl framework secretly brings you off-policy rl train- ing. Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, and 1 others. 2020. M...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Aime-preview: A rigorous and immediate evaluation framework for advanced mathematical rea- soning. https://github.com/GAIR-NLP/AIME-P review. GitHub repository. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others. 2025. Dapo: An open-source llm reinforcement learning syste...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Group Sequence Policy Optimization

Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures. In Proceedings of the 52nd Annual International Sym- posium on Computer Architecture, pages 1731–1745. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, and 1 others. 2025. Group sequence policy ...

work page internal anchor Pith review arXiv 2025