pith. machine review for the scientific record. sign in

arxiv: 2605.13907 · v1 · submitted 2026-05-13 · 📊 stat.ML · cs.AI· cs.LG

Recognition: no theorem link

AIS: Adaptive Importance Sampling for Quantized RL

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:09 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords adaptive importance samplingquantized reinforcement learningFP8 rolloutspolicy gradient biasLLM trainingGRPOnon-stationary mismatch
0
0 comments X

The pith

Adaptive Importance Sampling corrects non-stationary bias from low-precision rollouts while keeping their speed gains in LLM RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that low-precision FP8 rollouts speed up generation in RL for large language models but create a rollout-training mismatch with the BF16 trainer. This mismatch helps exploration early in training by exposing the policy to under-sampled trajectories, yet it shifts into destabilizing bias as the policy concentrates and can collapse training on reasoning tasks. AIS addresses the issue by computing three real-time diagnostics—weight reliability, divergence severity, and variance amplification—and combining them into a single per-batch mixing coefficient. The coefficient interpolates between the uncorrected gradient and the fully importance-weighted gradient, suppressing bias while retaining the exploratory benefit. Experiments integrate AIS into GRPO and test it on diffusion and autoregressive models across mathematical and planning benchmarks, showing it matches the BF16 baseline while preserving the 1.5 to 2.76x rollout speedup.

Core claim

The rollout-training mismatch is non-stationary and double-edged: early it supplies a stochastic exploration bonus, but later it injects bias that risks collapse; AIS resolves this by forming a per-batch mixing coefficient from weight reliability, divergence severity, and variance amplification that smoothly transitions between uncorrected and importance-weighted gradients.

What carries the argument

The per-batch mixing coefficient, built from three real-time diagnostics (weight reliability, divergence severity, and variance amplification), that controls interpolation strength between uncorrected and importance-weighted gradients.

If this is right

  • FP8 rollouts deliver 1.5 to 2.76x speedup without performance loss on most mathematical and planning tasks.
  • Training remains stable on both diffusion-based and autoregressive models when AIS is integrated into GRPO.
  • The same per-batch correction mechanism prevents outright collapse that otherwise occurs with uncorrected low-precision rollouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The diagnostic combination could be reused for other quantization schemes or sources of non-stationary bias in policy gradients.
  • If the mixing coefficient generalizes, it would reduce the need for full-precision rollouts in large-scale LLM RL pipelines.
  • Monitoring the three diagnostics separately might reveal new ways to detect when exploration has turned into harmful bias.

Load-bearing premise

The three real-time diagnostics can be combined into one mixing coefficient that reliably preserves early exploration benefits while suppressing later destabilizing bias across different models, tasks, and training stages without introducing new instabilities.

What would settle it

A training run on a held-out reasoning benchmark where the policy still collapses under FP8 rollouts even after AIS applies its mixing coefficient.

Figures

Figures reproduced from arXiv: 2605.13907 by Jiajun Zhou, Lingchao Zheng, Ngai Wong, Wei Shao, Yuwei Fan.

Figure 1
Figure 1. Figure 1: Reward trajectory of Qwen3-8B on GSM8K under FP8 rollout. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Quantized RL training pipeline. Modern large-scale RL pipelines for LLMs de￾couple rollout generation from policy optimiza￾tion to maximize hardware utilization [Espeholt et al., 2018]. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Variation analysis during LLaDA-8B training on GSM8K. While TIS stabilizes training over uncorrected FP8 rollout, applying the same correction strength to every batch is limited [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LLaDA-8B training behavior under different low-bit rollout strategies on the GSM8K. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end rollout speedup from FP8 quantization. Left: LLaDA-8B-Instruct across five [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Train–rollout mismatch on GSM8K across four configurations. To understand why AIS not only recovers but surpasses the BF16 baseline, we analyze the training dynamics through two diagnostic lenses. Mismatch suppression [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training reward on Countdown (left) and MATH (right). FP8 Rollout (red) exhibits [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Reinforcement learning (RL) for large language models (LLMs) is dominated by the cost of rollout generation, which has motivated the use of low-precision rollouts (e.g., FP8) paired with a BF16 trainer to improve throughput and reduce memory pressure. This introduces a rollout-training mismatch that biases the policy gradient and can cause training to collapse outright on reasoning benchmarks. We show that the mismatch is non-stationary and acts as a double-edged sword: early in training it provides a stochastic exploration bonus, exposing the gradient to trajectories the trainer would otherwise under-sample, but the same perturbation transitions into a destabilizing source of bias as the policy concentrates. To solve this, we propose Adaptive Importance Sampling (AIS), a correction framework that adjusts the strength of its intervention on a per-batch basis. AIS combines three real-time diagnostics, namely weight reliability, divergence severity, and variance amplification, into a single mixing coefficient that interpolates between the uncorrected and fully importance-weighted gradients, suppressing the destabilizing component of the mismatch while preserving its exploratory benefit. We integrate AIS into GRPO and evaluate it on the diffusion-based LLaDA-8B-Instruct and the autoregressive Qwen3-8B and Qwen3.5-9B across mathematical reasoning and planning benchmarks. AIS matches the BF16 baseline on most tasks while retaining the 1.5 to 2.76x rollout speedup of FP8.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Adaptive Importance Sampling (AIS) to address the non-stationary rollout-training mismatch in quantized RL for LLMs, where FP8 rollouts paired with BF16 trainers introduce bias that aids early exploration but risks later collapse. AIS computes a per-batch mixing coefficient from three real-time diagnostics (weight reliability, divergence severity, variance amplification) to interpolate between uncorrected and importance-weighted gradients. It is integrated into GRPO and evaluated on LLaDA-8B-Instruct, Qwen3-8B, and Qwen3.5-9B across mathematical reasoning and planning benchmarks, claiming to match BF16 baseline performance while retaining the 1.5–2.76× FP8 rollout speedup.

Significance. If the adaptive correction proves robust across models and stages, the work offers a practical route to higher-throughput RL training for LLMs by safely exploiting low-precision rollouts. The per-batch interpolation mechanism is a targeted response to the non-stationary nature of the mismatch and could generalize to other precision or distribution-shift settings in policy optimization.

major comments (2)
  1. [Method] The central construction of the mixing coefficient from the three diagnostics is presented without a derivation, stability analysis, or sensitivity study; this is load-bearing for the claim that the interpolation reliably retains early exploration benefits while suppressing later bias (see abstract and method description).
  2. [Experiments] The empirical claim that AIS 'matches the BF16 baseline on most tasks' is stated without quantitative metrics, tables, error bars, or per-task breakdowns; the absence of these details prevents verification of whether the reported speedups come at any hidden performance cost (see abstract and evaluation section).
minor comments (1)
  1. [Abstract] The acronym GRPO is used without definition or citation on first appearance, which may hinder readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and commit to the indicated revisions.

read point-by-point responses
  1. Referee: [Method] The central construction of the mixing coefficient from the three diagnostics is presented without a derivation, stability analysis, or sensitivity study; this is load-bearing for the claim that the interpolation reliably retains early exploration benefits while suppressing later bias (see abstract and method description).

    Authors: We agree that the mixing coefficient construction would benefit from additional formal support. In the revised manuscript we will add a derivation that motivates the per-batch interpolation from the three diagnostics (weight reliability, divergence severity, and variance amplification), a stability analysis establishing bounded variance of the resulting gradient estimator, and a sensitivity study that varies the relative weights of the diagnostics. These additions will directly substantiate the claim that early exploratory benefits are retained while later bias is suppressed. revision: yes

  2. Referee: [Experiments] The empirical claim that AIS 'matches the BF16 baseline on most tasks' is stated without quantitative metrics, tables, error bars, or per-task breakdowns; the absence of these details prevents verification of whether the reported speedups come at any hidden performance cost (see abstract and evaluation section).

    Authors: We acknowledge that the current presentation lacks the quantitative detail needed for verification. The revised manuscript will include full tables reporting exact performance numbers (mean and standard deviation) for AIS, uncorrected FP8, and BF16 baselines on every task and model, together with error bars from multiple independent runs and explicit per-task breakdowns. These additions will allow direct assessment that the 1.5–2.76× rollout speedups incur no hidden performance cost. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the AIS derivation

full rationale

The paper defines AIS as a heuristic that computes a per-batch mixing coefficient directly from three real-time, independently observable diagnostics (weight reliability, divergence severity, variance amplification) to interpolate between uncorrected and importance-weighted gradients. This construction does not reduce any claimed prediction or result to a fitted parameter, self-citation, or input by definition; the diagnostics are extracted from the ongoing training process without reference to target performance metrics, and the interpolation rule is presented as an empirical design choice validated across models and benchmarks. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain, leaving the central claim self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard importance-sampling assumptions plus the new claim that three diagnostics suffice to detect and correct the non-stationary transition; no explicit free parameters are stated.

axioms (1)
  • domain assumption Importance sampling can correct policy-gradient bias arising from rollout-training distribution mismatch
    Core premise of the AIS correction
invented entities (1)
  • Per-batch mixing coefficient driven by weight reliability, divergence severity, and variance amplification no independent evidence
    purpose: To interpolate between uncorrected and fully corrected gradients
    New control variable introduced to handle non-stationarity

pith-pipeline@v0.9.0 · 5567 in / 1326 out tokens · 88529 ms · 2026-05-15T03:09:29.288537+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 15 internal anchors

  1. [1]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  2. [2]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556,

  3. [3]

    Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298,

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  5. [5]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

  6. [6]

    History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588,

    Jingkai He, Tianjian Li, Erhu Feng, Dong Du, Qian Liu, Tao Liu, Yubin Xia, and Haibo Chen. History rhymes: Accelerating llm reinforcement learning with rhymerl.arXiv preprint arXiv:2508.18588,

  7. [7]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  8. [8]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    10 Jian Hu, Xibin Wu, Weixun Zhu, Weihao Wang, Dehao Zhang, and Yu Cao. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143,

  9. [9]

    Code-r1: Reproducing r1 for code with reliable rewards.arXiv preprint arXiv:2503.18470, 3,

    Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards.arXiv preprint arXiv:2503.18470, 3,

  10. [10]

    Qllm: Accurate and efficient low-bitwidth quantization for large language models.arXiv preprint arXiv:2310.08041, 2023a

    Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. Qllm: Accurate and efficient low-bitwidth quantization for large language models.arXiv preprint arXiv:2310.08041, 2023a. Liyuan Liu, Feng Yao, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Flashrl: 8bit rollouts, full power rl, 2025a. Zechun Liu, Barlas Oguz, C...

  11. [11]

    Deepcoder: A fully open-source 14b coder at o3-mini level

    Michael Luo, Sijun Tan, Roy Huang, Xiaoxiang Shi, Rachel Xin, Colin Cai, Ameen Patel, Alpay Ariyak, Qingyang Wu, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica, and Tianjun Zhang. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/ DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee...

  12. [12]

    Available: https://arxiv.org/abs/2106.08295

    Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tij- men Blankevoort. A white paper on neural network quantization.arXiv preprint arXiv:2106.08295,

  13. [13]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji- Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,

  14. [14]

    NVIDIA Technical Blog. 11 Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, Ruihang Li, Miaosen Zhang, Chen Li, Jia Ning, Ruizhe Wang, Zheng Zhang, Shuguang Liu, Joe Chau, Han Hu, and Peng Cheng. FP8-LM: Training FP8 large language models.arXiv preprint arXiv:2310.18313,

  15. [15]

    FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

    Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang, Jingyi Yang, and Junjie Lai. Fp8-rl: A practical and stable low-precision stack for llm reinforcement learning.arXiv preprint arXiv:2601.18150,

  16. [16]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  17. [17]

    Unified FP8: Moving beyond mixed precision for stable and accelerated MoE RL

    SGLang RL Team. Unified FP8: Moving beyond mixed precision for stable and accelerated MoE RL. https://www.lmsys.org/blog/2025-11-25-fp8-rl/ ,

  18. [18]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  19. [19]

    NeMo-Aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481,

    Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi, et al. NeMo-Aligner: Scalable toolkit for efficient model alignment.arXiv preprint arXiv:2405.01481,

  20. [20]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256,

  21. [21]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

  22. [22]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  23. [23]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, August 2025a. URL https: //fengyao.notion.site/off-policy-rl. Feng Yao et al. On the rollout-training mismatch in modern RL systems. 2025b. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo...

  24. [24]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

  25. [25]

    12 Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al

    arXiv:2504.12216. 12 Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277,

  26. [26]

    Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

    Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

  27. [27]

    For INT8 quantization, the main challenge is activation outliers

    13 A Appendix B Related Work Quantization of LLMs.Quantization is a standard technique for compressing large language models (LLMs) [Gholami et al., 2022, Nagel et al., 2021], with 8-bit formats offering a favorable trade-off between compression and accuracy under native hardware support. For INT8 quantization, the main challenge is activation outliers. L...

  28. [28]

    to ensure direct comparability, extend- ing it with an FP8 rollout engine and the AIS correction module. Our Qwen experiments adapt the same pipeline to autoregressive architectures and incorporate evaluation components from Deep- Scaler [Luo et al., 2025b] and lm-evaluation-harness [Gao et al., 2021]. Full implementation details, including configuration ...

  29. [29]

    Training modes.We evaluate under two regimes to probe robustness across scaling strategies: • LoRA[Hu et al., 2022]: r= 128 for LLaDA-8B; r= 64 for Qwen3.5-9B

    Framework.Our codebase extends the TRL library [von Werra et al., 2022] with dedicated trainers for mismatched-precision RL, supporting both a BF16 baseline (rollout and learner both in BF16) and an FP8-rollout configuration paired with the AIS correction module. Training modes.We evaluate under two regimes to probe robustness across scaling strategies: •...

  30. [30]

    Table 3: Hyperparameter settings for GRPO training across tasks and training modes. Hyperparameter Full-parameter FT LoRA (Qwen3.5-9B) LoRA (LLaDA-8B) Learning rate1×10 −6 to1×10 −5 1×10 −7 to1×10 −6 1×10 −6 to3×10 −6 PPO clipping (ϵ)0.2 0.2 0.2 Number of generations8 8 8 Max prompt length256 256 256 Max completion length256∼512 256∼512 256∼1024 Effective...

  31. [31]

    to align with the LLaDA masked-prediction objective. We use a composite reward combining (i) aformat rewardfor proper XML tagging ( <reasoning>, <answer>) and answer delimiters, and (ii) acorrectness reward(exact match for GSM8K, boxed-answer equivalence for MATH500, and valid arithmetic verification for Countdown). Deviation from the reference recipe.The...

  32. [32]

    Throughout this analysis, expectations Erollout and Etrain are taken with respect toπ rollout andπ train respectively, and∥ · ∥denotes the Euclidean norm

    We (i) derive the oracle mixing coefficientα⋆ under a surrogate mean-squared error criterion (Propo- sition 1), (ii) establish a bounded second moment for the truncated AIS estimator (Proposition 2), and (iii) demonstrate that AIS exactly recovers the on-policy gradient when rollout-training mismatch is absent (Proposition 3). Throughout this analysis, ex...

  33. [33]

    We hypothesize that the stochastic perturbations introduced by FP8 quantization act as an implicit exploration bonus, driving the policy into low-probability regions of the trajectory space. When these trajectories are appropriately re-weighted by AIS, the resulting gradient benefits from a broader exploration landscape while remaining approximately unbia...