Recognition: no theorem link
AIS: Adaptive Importance Sampling for Quantized RL
Pith reviewed 2026-05-15 03:09 UTC · model grok-4.3
The pith
Adaptive Importance Sampling corrects non-stationary bias from low-precision rollouts while keeping their speed gains in LLM RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The rollout-training mismatch is non-stationary and double-edged: early it supplies a stochastic exploration bonus, but later it injects bias that risks collapse; AIS resolves this by forming a per-batch mixing coefficient from weight reliability, divergence severity, and variance amplification that smoothly transitions between uncorrected and importance-weighted gradients.
What carries the argument
The per-batch mixing coefficient, built from three real-time diagnostics (weight reliability, divergence severity, and variance amplification), that controls interpolation strength between uncorrected and importance-weighted gradients.
If this is right
- FP8 rollouts deliver 1.5 to 2.76x speedup without performance loss on most mathematical and planning tasks.
- Training remains stable on both diffusion-based and autoregressive models when AIS is integrated into GRPO.
- The same per-batch correction mechanism prevents outright collapse that otherwise occurs with uncorrected low-precision rollouts.
Where Pith is reading between the lines
- The diagnostic combination could be reused for other quantization schemes or sources of non-stationary bias in policy gradients.
- If the mixing coefficient generalizes, it would reduce the need for full-precision rollouts in large-scale LLM RL pipelines.
- Monitoring the three diagnostics separately might reveal new ways to detect when exploration has turned into harmful bias.
Load-bearing premise
The three real-time diagnostics can be combined into one mixing coefficient that reliably preserves early exploration benefits while suppressing later destabilizing bias across different models, tasks, and training stages without introducing new instabilities.
What would settle it
A training run on a held-out reasoning benchmark where the policy still collapses under FP8 rollouts even after AIS applies its mixing coefficient.
Figures
read the original abstract
Reinforcement learning (RL) for large language models (LLMs) is dominated by the cost of rollout generation, which has motivated the use of low-precision rollouts (e.g., FP8) paired with a BF16 trainer to improve throughput and reduce memory pressure. This introduces a rollout-training mismatch that biases the policy gradient and can cause training to collapse outright on reasoning benchmarks. We show that the mismatch is non-stationary and acts as a double-edged sword: early in training it provides a stochastic exploration bonus, exposing the gradient to trajectories the trainer would otherwise under-sample, but the same perturbation transitions into a destabilizing source of bias as the policy concentrates. To solve this, we propose Adaptive Importance Sampling (AIS), a correction framework that adjusts the strength of its intervention on a per-batch basis. AIS combines three real-time diagnostics, namely weight reliability, divergence severity, and variance amplification, into a single mixing coefficient that interpolates between the uncorrected and fully importance-weighted gradients, suppressing the destabilizing component of the mismatch while preserving its exploratory benefit. We integrate AIS into GRPO and evaluate it on the diffusion-based LLaDA-8B-Instruct and the autoregressive Qwen3-8B and Qwen3.5-9B across mathematical reasoning and planning benchmarks. AIS matches the BF16 baseline on most tasks while retaining the 1.5 to 2.76x rollout speedup of FP8.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Adaptive Importance Sampling (AIS) to address the non-stationary rollout-training mismatch in quantized RL for LLMs, where FP8 rollouts paired with BF16 trainers introduce bias that aids early exploration but risks later collapse. AIS computes a per-batch mixing coefficient from three real-time diagnostics (weight reliability, divergence severity, variance amplification) to interpolate between uncorrected and importance-weighted gradients. It is integrated into GRPO and evaluated on LLaDA-8B-Instruct, Qwen3-8B, and Qwen3.5-9B across mathematical reasoning and planning benchmarks, claiming to match BF16 baseline performance while retaining the 1.5–2.76× FP8 rollout speedup.
Significance. If the adaptive correction proves robust across models and stages, the work offers a practical route to higher-throughput RL training for LLMs by safely exploiting low-precision rollouts. The per-batch interpolation mechanism is a targeted response to the non-stationary nature of the mismatch and could generalize to other precision or distribution-shift settings in policy optimization.
major comments (2)
- [Method] The central construction of the mixing coefficient from the three diagnostics is presented without a derivation, stability analysis, or sensitivity study; this is load-bearing for the claim that the interpolation reliably retains early exploration benefits while suppressing later bias (see abstract and method description).
- [Experiments] The empirical claim that AIS 'matches the BF16 baseline on most tasks' is stated without quantitative metrics, tables, error bars, or per-task breakdowns; the absence of these details prevents verification of whether the reported speedups come at any hidden performance cost (see abstract and evaluation section).
minor comments (1)
- [Abstract] The acronym GRPO is used without definition or citation on first appearance, which may hinder readability for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and commit to the indicated revisions.
read point-by-point responses
-
Referee: [Method] The central construction of the mixing coefficient from the three diagnostics is presented without a derivation, stability analysis, or sensitivity study; this is load-bearing for the claim that the interpolation reliably retains early exploration benefits while suppressing later bias (see abstract and method description).
Authors: We agree that the mixing coefficient construction would benefit from additional formal support. In the revised manuscript we will add a derivation that motivates the per-batch interpolation from the three diagnostics (weight reliability, divergence severity, and variance amplification), a stability analysis establishing bounded variance of the resulting gradient estimator, and a sensitivity study that varies the relative weights of the diagnostics. These additions will directly substantiate the claim that early exploratory benefits are retained while later bias is suppressed. revision: yes
-
Referee: [Experiments] The empirical claim that AIS 'matches the BF16 baseline on most tasks' is stated without quantitative metrics, tables, error bars, or per-task breakdowns; the absence of these details prevents verification of whether the reported speedups come at any hidden performance cost (see abstract and evaluation section).
Authors: We acknowledge that the current presentation lacks the quantitative detail needed for verification. The revised manuscript will include full tables reporting exact performance numbers (mean and standard deviation) for AIS, uncorrected FP8, and BF16 baselines on every task and model, together with error bars from multiple independent runs and explicit per-task breakdowns. These additions will allow direct assessment that the 1.5–2.76× rollout speedups incur no hidden performance cost. revision: yes
Circularity Check
No significant circularity detected in the AIS derivation
full rationale
The paper defines AIS as a heuristic that computes a per-batch mixing coefficient directly from three real-time, independently observable diagnostics (weight reliability, divergence severity, variance amplification) to interpolate between uncorrected and importance-weighted gradients. This construction does not reduce any claimed prediction or result to a fitted parameter, self-citation, or input by definition; the diagnostics are extracted from the ongoing training process without reference to target performance metrics, and the interpolation rule is presented as an empirical design choice validated across models and benchmarks. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain, leaving the central claim self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Importance sampling can correct policy-gradient bias arising from rollout-training distribution mismatch
invented entities (1)
-
Per-batch mixing coefficient driven by weight reliability, divergence severity, and variance amplification
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588,
-
[7]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
10 Jian Hu, Xibin Wu, Weixun Zhu, Weihao Wang, Dehao Zhang, and Yu Cao. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143,
work page internal anchor Pith review arXiv
-
[9]
Code-r1: Reproducing r1 for code with reliable rewards.arXiv preprint arXiv:2503.18470, 3,
Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards.arXiv preprint arXiv:2503.18470, 3,
-
[10]
Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. Qllm: Accurate and efficient low-bitwidth quantization for large language models.arXiv preprint arXiv:2310.08041, 2023a. Liyuan Liu, Feng Yao, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Flashrl: 8bit rollouts, full power rl, 2025a. Zechun Liu, Barlas Oguz, C...
-
[11]
Deepcoder: A fully open-source 14b coder at o3-mini level
Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica, and Tianjun Zhang. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/ DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee...
-
[12]
Available: https://arxiv.org/abs/2106.08295
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tij- men Blankevoort. A white paper on neural network quantization.arXiv preprint arXiv:2106.08295,
-
[13]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji- Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
NVIDIA Technical Blog. 11 Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, and Peng Cheng. FP8-LM: Training FP8 large language models.arXiv preprint arXiv:2310.18313,
-
[15]
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang, Jingyi Yang, and Junjie Lai. Fp8-rl: A practical and stable low-precision stack for llm reinforcement learning.arXiv preprint arXiv:2601.18150,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Unified FP8: Moving beyond mixed precision for stable and accelerated MoE RL
SGLang RL Team. Unified FP8: Moving beyond mixed precision for stable and accelerated MoE RL. https://www.lmsys.org/blog/2025-11-25-fp8-rl/ ,
work page 2025
-
[18]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
NeMo-Aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481,
Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. NeMo-Aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481,
-
[20]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, August 2025a. URL https: //fengyao.notion.site/off-policy-rl. Feng Yao et al. On the rollout-training mismatch in modern RL systems. 2025b. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
arXiv:2504.12216. 12 Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277,
-
[26]
Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,
-
[27]
For INT8 quantization, the main challenge is activation outliers
13 A Appendix B Related Work Quantization of LLMs.Quantization is a standard technique for compressing large language models (LLMs) [Gholami et al., 2022, Nagel et al., 2021], with 8-bit formats offering a favorable trade-off between compression and accuracy under native hardware support. For INT8 quantization, the main challenge is activation outliers. L...
work page 2022
-
[28]
to ensure direct comparability, extend- ing it with an FP8 rollout engine and the AIS correction module. Our Qwen experiments adapt the same pipeline to autoregressive architectures and incorporate evaluation components from Deep- Scaler [Luo et al., 2025b] and lm-evaluation-harness [Gao et al., 2021]. Full implementation details, including configuration ...
work page 2021
-
[29]
Framework.Our codebase extends the TRL library [von Werra et al., 2022] with dedicated trainers for mismatched-precision RL, supporting both a BF16 baseline (rollout and learner both in BF16) and an FP8-rollout configuration paired with the AIS correction module. Training modes.We evaluate under two regimes to probe robustness across scaling strategies: •...
work page 2022
-
[30]
Table 3: Hyperparameter settings for GRPO training across tasks and training modes. Hyperparameter Full-parameter FT LoRA (Qwen3.5-9B) LoRA (LLaDA-8B) Learning rate1×10 −6 to1×10 −5 1×10 −7 to1×10 −6 1×10 −6 to3×10 −6 PPO clipping (ϵ)0.2 0.2 0.2 Number of generations8 8 8 Max prompt length256 256 256 Max completion length256∼512 256∼512 256∼1024 Effective...
work page 2025
-
[31]
to align with the LLaDA masked-prediction objective. We use a composite reward combining (i) aformat rewardfor proper XML tagging ( <reasoning>, <answer>) and answer delimiters, and (ii) acorrectness reward(exact match for GSM8K, boxed-answer equivalence for MATH500, and valid arithmetic verification for Countdown). Deviation from the reference recipe.The...
work page 2023
-
[32]
We (i) derive the oracle mixing coefficientα⋆ under a surrogate mean-squared error criterion (Propo- sition 1), (ii) establish a bounded second moment for the truncated AIS estimator (Proposition 2), and (iii) demonstrate that AIS exactly recovers the on-policy gradient when rollout-training mismatch is absent (Proposition 3). Throughout this analysis, ex...
-
[33]
We hypothesize that the stochastic perturbations introduced by FP8 quantization act as an implicit exploration bonus, driving the policy into low-probability regions of the trajectory space. When these trajectories are appropriately re-weighted by AIS, the resulting gradient benefits from a broader exploration landscape while remaining approximately unbia...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.