DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

Dayiheng Liu; Huiqiang Jiang; Linfeng Zhang; Shaobo Wang; Sihang Li; Wenjie Qiu; Xingzhang Ren; Xuming Hu; Yafeng Sun; Yucheng Li

arxiv: 2605.17295 · v1 · pith:NHFCSFEZnew · submitted 2026-05-17 · 💻 cs.LG · cs.CL

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

Shaobo Wang , Yujie Chen , Yafeng Sun , Wenjie Qiu , Zhihui Xie , Sihang Li , Yucheng Li , Huiqiang Jiang

show 4 more authors

Xingzhang Ren Xuming Hu Dayiheng Liu Linfeng Zhang

This is my paper

Pith reviewed 2026-05-20 15:06 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords distribution-matching RLimportance samplingpartition functionoffline estimationLLM reinforcement learningpolicy optimizationdecouplingtrajectory sampling

0 comments

The pith

DISA estimates the prompt-dependent partition function offline with importance sampling on proposal trajectories and freezes the estimate before policy optimization begins in distribution-matching LLM-RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to solve the issue that online-coupled distribution-matching RL for language models ties partition function calibration directly to policy updates, so estimation errors distort the learning and cannot be checked separately. It does this by drawing trajectories from a fixed proposal distribution ahead of time, computing the partition function estimate through importance sampling, and locking that value in place once policy optimization starts. The decoupling keeps the core goal of allocating probability across the full set of reward-shaped solutions intact while making the estimation and learning steps independent in their data sources, gradients, loss terms, and diagnostic checks. A reader would care because this separation can improve stability and allow clearer inspection of whether the calibration step succeeded on its own. Experiments across math and code tasks with open-weight models show the method holds or beats the online baseline while producing more diverse solution strategies than pure reward maximization.

Core claim

DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins, preserving the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics.

What carries the argument

The frozen importance-sampled estimate of the prompt-dependent partition function, which decouples calibration from the subsequent policy updates.

If this is right

On two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL.
DISA outperforms reward-maximization baselines GRPO and GSPO on math averages.
DISA exceeds LoRASFT distillation by up to 13.8 Mean@8 points when using the same offline trajectories.
An LLM-as-judge evaluation shows DISA retains substantially more strategy-level diversity than reward-maximization baselines.
Sensitivity studies on proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The offline nature of the estimation step could reduce training-time compute by avoiding repeated partition-function calculations during policy updates.
Separate diagnostics for the frozen estimate might make it easier to diagnose whether poor final performance stems from bad calibration or from the optimization process itself.
Similar offline anchoring could be tested in other sequential generation settings that require matching a target distribution over multiple valid outputs.
The bias-variance tradeoff observed in the sensitivity studies suggests that proposal quality remains a key lever even after the estimate is frozen.

Load-bearing premise

Offline trajectories from a fixed proposal distribution are representative enough to yield an accurate low-variance importance-sampling estimate of the partition function that can be frozen without distorting later policy updates.

What would settle it

A direct comparison showing that policy optimization with the frozen offline estimate produces measurably different solution distributions or lower performance than an otherwise identical run that estimates the partition function online would indicate the decoupling fails to preserve the objective.

Figures

Figures reproduced from arXiv: 2605.17295 by Dayiheng Liu, Huiqiang Jiang, Linfeng Zhang, Shaobo Wang, Sihang Li, Wenjie Qiu, Xingzhang Ren, Xuming Hu, Yafeng Sun, Yucheng Li, Yujie Chen, Zhihui Xie.

**Figure 1.** Figure 1: Two families of post-training and the gradient-backflow asymmetry that distinguishes them. (a) On a multi-modal reward r(q, ·), reward maximization (PPO/GRPO/GSPO) collapses to a single mode, while distribution matching targets πrefe βr/Z(q) and preserves multi-modal structure. (b) Prior methods co-train the partition function Zϕ with πθ on a shared loss, so partition-function error rides into ∇θ. DISA rep… view at source ↗

**Figure 2.** Figure 2: The pipeline of DISA, which includes three stages. Stage 1, offline IS estimation: for each prompt q, draw N trajectories from a proposal pT and form the per-prompt label log ZbIS(q) via the logsumexp aggregator. Stage 2, amortization: fit a regressor gψ to these labels by least squares. Stage 3, anchored RL: freeze gψ⋆ and substitute it for log Zϕ in the trajectory-balance loss, so the partition function … view at source ↗

**Figure 3.** Figure 3: Sensitivity to the proposal model pT. pT=235B: default Qwen3-235B-A22B-Instruct-2507; pT=4B: weaker Qwen3-4B-Instruct-2507. (a) DISA on Qwen3-4B-Base under the two proposals, Mean@8 across six math benchmarks. (b) Validation MSE of the Stage 2 regressor gψ across training epochs under the two proposals; stars mark the selected checkpoints. and that the offline-then-online split of DISA recovers the FlowRL … view at source ↗

**Figure 4.** Figure 4: Sensitivity to the inverse-temperature β on Qwen2.5-7B code. DISA at β ∈ {10, 15, 20}, pass@1 on the left and pass@16 on the right, with all other hyperparameters held fixed. The default β = 15, hatched and starred, matches FlowRL; the untrained backbone is shown in gold [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Cluster-count selection for the variance–bias subset. SSE of k-means clustering on math-prompt embeddings versus the number of clusters k. The highlighted elbow at k = 8 is the cluster count used for stratified sampling of the ∼500-prompt subset. as M varies: the empirical variance of log ZbM(q) across subsamples, and the per-prompt relative bias |ZbM(q) − Zb32(q)|/Zb32(q). Results [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 6.** Figure 6: Variance and relative bias of the offline log Z estimator versus M. Top panels: mean variance of log ZbM(q) across repeated subsamples, with ± standard deviation on the left and the smoothed trend on the right. Bottom panels: mean per-prompt relative bias |ZbM − Zb32|/Zb32, with ± standard deviation on the left and the smoothed trend on the right. Both curves exhibit a clear elbow around M = 8, with dimini… view at source ↗

**Figure 7.** Figure 7: Distributions of variance and relative bias by M. Boxplots of Var(log ZbM) in the left panel and the relative bias |ZbM − Zb32|/Zb32 in the right panel across the ∼ 500 subset prompts. Dispersion narrows sharply between M = 2 and M = 8 and only marginally between M = 8 and M = 16; the extreme outliers visible at M ≤ 4 have largely disappeared by M = 8. M = 16. We therefore set the offline rollout count to … view at source ↗

**Figure 4.** Figure 4: • β = 10 vs. default β = 15. Average pass@1 drops slightly from 31.80 to 30.91, with β = 10 in fact slightly exceeding the default on LiveCodeBench and CodeForces; average pass@16 drops by 5.4 points, from 45.3 to 39.9. • β = 20 vs. default β = 15. Average pass@1 collapses to 13.75, an ∼18-point drop driven mainly by HumanEval+, where it falls from 71.2 to 24.1. Average pass@16 falls to 28.3, only +2.5 ove… view at source ↗

read the original abstract

Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins. This decoupling preserves the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics. Empirically, on two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL, outperforms rewardmaximization baselines GRPO and GSPO on math averages, and exceeds LoRASFT distillation by up to 13.8 Mean@8 points on the same offline trajectories. An LLM-as-judge evaluation further shows that DISA retains substantially more strategy-level diversity than reward-maximization baselines, and sensitivity studies on the proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DISA moves partition function estimation offline via importance sampling and freezes it before policy updates, which is a practical separation but leaves open whether the estimator stays accurate enough in peaked LLM spaces.

read the letter

The core move here is pulling the prompt-dependent partition function out of the RL loop entirely. They sample trajectories from a fixed proposal offline, compute the importance-sampled estimate of Z, and then treat that number as a constant during policy optimization. This keeps the distribution-matching objective on paper while splitting the data, gradients, and diagnostics cleanly from the learning step. That separation is the actual novelty relative to online-coupled approaches like FlowRL.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DISA (Decoupled Importance-Sampled Anchoring) for distribution-matching RL in LLMs. It draws proposal trajectories offline from a fixed distribution, estimates the prompt-dependent partition function Z via importance sampling, freezes the resulting estimate, and then performs separate policy optimization. The central claim is that this decoupling preserves the exact distribution-matching objective while cleanly separating estimation from learning in data, gradients, loss, and diagnostics. Experiments across math and code benchmarks on open-weight models report performance matching or exceeding the online-coupled baseline FlowRL, outperforming reward-maximization methods like GRPO and GSPO on math averages, and retaining higher strategy-level diversity.

Significance. If the offline IS estimates of Z prove sufficiently low-variance and unbiased, the decoupling would enable independent calibration and diagnostics for distribution-matching objectives, mitigating the mode-collapse issues of standard reward-max RL while reducing online computational overhead. The reported diversity gains and sensitivity studies aligned with bias-variance predictions would strengthen the case for practical adoption in reasoning agents.

major comments (2)

[§3.2, Eq. (7)] §3.2, Eq. (7): the claim that freezing the IS estimate of the prompt-dependent partition function Z preserves the exact distribution-matching objective is load-bearing for the decoupling result, yet no variance bound or concentration analysis is given for the estimator weights exp(r(τ))/q(τ) when q is a base model or SFT checkpoint that assigns near-zero mass to rare high-reward trajectories.
[Table 2 and §5.3] Table 2 and §5.3: the reported Mean@8 gains over LoRASFT and FlowRL lack error bars, number of seeds, or exact prompt sampling protocols, so it is impossible to determine whether the observed improvements are statistically distinguishable from the variance that would arise from an inaccurate frozen Z.

minor comments (2)

[§2] The notation for the proposal strength and inverse temperature parameters is introduced without an explicit table of symbols, making cross-references between the analysis and experiments harder to follow.
[Figure 3] Figure 3 caption should state the exact number of offline proposal trajectories used per prompt so that readers can assess the IS sample size directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate.

read point-by-point responses

Referee: [§3.2, Eq. (7)] §3.2, Eq. (7): the claim that freezing the IS estimate of the prompt-dependent partition function Z preserves the exact distribution-matching objective is load-bearing for the decoupling result, yet no variance bound or concentration analysis is given for the estimator weights exp(r(τ))/q(τ) when q is a base model or SFT checkpoint that assigns near-zero mass to rare high-reward trajectories.

Authors: We thank the referee for identifying this key theoretical point. The importance-sampling estimator for the prompt-dependent partition function Z is unbiased by construction, since its expectation under trajectories drawn from the proposal q recovers the true Z exactly. Consequently, the distribution-matching objective is preserved in expectation when the estimate is frozen. We acknowledge, however, that the manuscript provides no explicit variance bound or concentration result for the weights exp(r(τ))/q(τ), particularly when q is a base model or SFT checkpoint with limited mass on rare high-reward trajectories. This is a fair observation. In the revised manuscript we will add a short discussion in §3.2 clarifying that the objective is preserved in expectation, together with a qualitative analysis of the bias-variance tradeoff and a simple variance bound under the mild assumption that the proposal has positive (though possibly small) coverage of the relevant support. We will also report the effective sample size of the importance weights for each benchmark to give readers a practical sense of estimator stability. This is a partial revision: we strengthen the presentation and add supporting analysis without claiming a new tight concentration inequality. revision: partial
Referee: [Table 2 and §5.3] Table 2 and §5.3: the reported Mean@8 gains over LoRASFT and FlowRL lack error bars, number of seeds, or exact prompt sampling protocols, so it is impossible to determine whether the observed improvements are statistically distinguishable from the variance that would arise from an inaccurate frozen Z.

Authors: We agree that the absence of error bars, seed counts, and precise sampling protocols limits the ability to assess statistical distinguishability of the reported gains from variance attributable to the frozen Z estimate. This was an oversight in the original submission. In the revised version we will (i) recompute and report all Mean@8 numbers in Table 2 with standard-error bars obtained from five independent random seeds, (ii) add an explicit description in §5.3 of the prompt-sampling protocol (including how prompts were drawn from each benchmark and the exact number of prompts per task), and (iii) include a brief sensitivity paragraph discussing how moderate perturbations to the frozen Z affect downstream performance. These additions will allow readers to evaluate whether the observed improvements exceed the combined variance from both the Z estimator and training stochasticity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via methodological decoupling

full rationale

The paper presents DISA as a decoupling of offline importance sampling for estimating the prompt-dependent partition function Z from subsequent policy optimization, with the estimate frozen before RL begins. No equations or steps in the provided abstract or description reduce the central claim to a self-definition, fitted input renamed as prediction, or load-bearing self-citation. The distribution-matching objective is preserved by construction of the separation in data/gradients/loss, and empirical results on benchmarks provide independent content. This is the common case of an honest non-finding.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard importance-sampling and RL assumptions plus the new DISA procedure; two tunable quantities are highlighted in sensitivity studies.

free parameters (2)

proposal strength
Appears in sensitivity studies as a control on the offline proposal distribution used for importance sampling.
inverse temperature
Controls reward shaping in the distribution-matching objective and is varied in sensitivity analyses.

axioms (1)

domain assumption Importance sampling from offline proposal trajectories can produce a usable estimate of the prompt-dependent partition function.
This assumption enables freezing the estimate before policy optimization begins.

pith-pipeline@v0.9.0 · 5848 in / 1476 out tokens · 64455 ms · 2026-05-20T15:06:19.220267+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the trajectory-balance (TB) loss LTB(θ, ϕ) = E[(log Z_ϕ(q) + log π_θ(o|q) − log π_ref(o|q) − β r̃(q,o))²]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 17 internal anchors

[1]

arXiv preprint arXiv:2509.15207 , year=

Flowrl: Matching reward distributions for llm reasoning , author=. arXiv preprint arXiv:2509.15207 , year=

work page arXiv
[2]

arXiv preprint arXiv:2310.04363 , year=

Amortizing intractable inference in large language models , author=. arXiv preprint arXiv:2310.04363 , year=

work page arXiv
[3]

arXiv preprint arXiv:2406.05673 , year=

Flow of reasoning: Training llms for divergent reasoning with minimal examples , author=. arXiv preprint arXiv:2406.05673 , year=

work page arXiv
[4]

arXiv preprint arXiv:2503.18929 , year=

Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training , author=. arXiv preprint arXiv:2503.18929 , year=

work page arXiv
[5]

Advances in neural information processing systems , volume=

Flow network based generative models for non-iterative diverse candidate generation , author=. Advances in neural information processing systems , volume=

work page
[6]

Advances in Neural Information Processing Systems , volume=

Trajectory balance: Improved credit assignment in gflownets , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2501.12948 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

work page 2026
[12]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization , author=. arXiv preprint arXiv:2501.03262 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2504.14286 , year=

Srpo: A cross-domain implementation of large-scale reinforcement learning on llm , author=. arXiv preprint arXiv:2504.14286 , year=

work page arXiv
[16]

Agentic Reinforced Policy Optimization

Agentic reinforced policy optimization , author=. arXiv preprint arXiv:2507.19849 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2507.12507 , year=

Scaling up rl: Unlocking diverse reasoning in llms via prolonged training , author=. arXiv preprint arXiv:2507.12507 , year=

work page arXiv
[18]

arXiv preprint arXiv:2505.09655 , year=

Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero-like training of large language models , author=. arXiv preprint arXiv:2505.09655 , year=

work page arXiv
[19]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Reasoning with exploration: An entropy perspective on reinforcement learning for llms, 2025 , author=

work page 2025
[22]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[23]

International Conference on Artificial Intelligence and Statistics , pages=

A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024
[24]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[25]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Raft: Reward ranked finetuning for generative foundation model alignment , author=. arXiv preprint arXiv:2304.06767 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2407.14622 , year=

Bond: Aligning llms with best-of-n distillation , author=. arXiv preprint arXiv:2407.14622 , year=

work page arXiv
[28]

Advances in neural information processing systems , volume=

Implicit generation and modeling with energy based models , author=. Advances in neural information processing systems , volume=

work page
[29]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[30]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

work page
[32]

arXiv preprint arXiv:2408.06195 , year=

Mutual reasoning makes smaller llms stronger problem-solvers , author=. arXiv preprint arXiv:2408.06195 , year=

work page arXiv
[33]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

work page
[34]

Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal =

work page
[35]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page
[38]

2025 , howpublished =

work page 2025
[39]

2025 , howpublished =

Introducing. 2025 , howpublished =

work page 2025
[40]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[41]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

work page
[43]

Owen , year = 2013, title =

Art B. Owen , year = 2013, title =

work page 2013
[44]

Monte Carlo Statistical Method , volume =

Robert, Christian and Casella, George , year =. Monte Carlo Statistical Method , volume =. Technometrics , doi =

work page
[45]

Wiley Interdisciplinary Reviews: Computational Statistics , volume=

Importance sampling: a review , author=. Wiley Interdisciplinary Reviews: Computational Statistics , volume=. 2010 , publisher=

work page 2010
[46]

arXiv preprint arXiv:2102.05407 , year=

Advances in importance sampling , author=. arXiv preprint arXiv:2102.05407 , year=

work page arXiv
[47]

Journal of the American Statistical Association , volume=

Safe and effective importance sampling , author=. Journal of the American Statistical Association , volume=. 2000 , publisher=

work page 2000
[48]

Eligibility traces for off-policy policy evaluation , author=

work page
[49]

Advances in neural information processing systems , volume=

Safe and efficient off-policy reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[50]

arXiv preprint arXiv:2507.21053 , year=

Flow matching policy gradients , author=. arXiv preprint arXiv:2507.21053 , year=

work page arXiv
[51]

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation , author=. arXiv preprint arXiv:2604.13010 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2510.13651 , year=

What is the objective of reasoning with reinforcement learning? , author=. arXiv preprint arXiv:2510.13651 , year=

work page arXiv
[53]

2026 , eprint=

LAD: Learning Advantage Distribution for Reasoning , author=. 2026 , eprint=

work page 2026
[54]

2024 , url =

American Invitational Mathematics Examination (AIME) 2024 , author =. 2024 , url =

work page 2024
[55]

2025 , url =

American Invitational Mathematics Examination (AIME) 2025 , author =. 2025 , url =

work page 2025
[56]

Hugging Face repository , howpublished =

CodeForces , author=. Hugging Face repository , howpublished =. 2025 , publisher =

work page 2025
[57]

DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , author=

work page

[1] [1]

arXiv preprint arXiv:2509.15207 , year=

Flowrl: Matching reward distributions for llm reasoning , author=. arXiv preprint arXiv:2509.15207 , year=

work page arXiv

[2] [2]

arXiv preprint arXiv:2310.04363 , year=

Amortizing intractable inference in large language models , author=. arXiv preprint arXiv:2310.04363 , year=

work page arXiv

[3] [3]

arXiv preprint arXiv:2406.05673 , year=

Flow of reasoning: Training llms for divergent reasoning with minimal examples , author=. arXiv preprint arXiv:2406.05673 , year=

work page arXiv

[4] [4]

arXiv preprint arXiv:2503.18929 , year=

Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training , author=. arXiv preprint arXiv:2503.18929 , year=

work page arXiv

[5] [5]

Advances in neural information processing systems , volume=

Flow network based generative models for non-iterative diverse candidate generation , author=. Advances in neural information processing systems , volume=

work page

[6] [6]

Advances in Neural Information Processing Systems , volume=

Trajectory balance: Improved credit assignment in gflownets , author=. Advances in Neural Information Processing Systems , volume=

work page

[7] [7]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2501.12948 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

work page 2026

[12] [12]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization , author=. arXiv preprint arXiv:2501.03262 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2504.14286 , year=

Srpo: A cross-domain implementation of large-scale reinforcement learning on llm , author=. arXiv preprint arXiv:2504.14286 , year=

work page arXiv

[16] [16]

Agentic Reinforced Policy Optimization

Agentic reinforced policy optimization , author=. arXiv preprint arXiv:2507.19849 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2507.12507 , year=

Scaling up rl: Unlocking diverse reasoning in llms via prolonged training , author=. arXiv preprint arXiv:2507.12507 , year=

work page arXiv

[18] [18]

arXiv preprint arXiv:2505.09655 , year=

Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero-like training of large language models , author=. arXiv preprint arXiv:2505.09655 , year=

work page arXiv

[19] [19]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Reasoning with exploration: An entropy perspective on reinforcement learning for llms, 2025 , author=

work page 2025

[22] [22]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page

[23] [23]

International Conference on Artificial Intelligence and Statistics , pages=

A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024

[24] [24]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[25] [25]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Raft: Reward ranked finetuning for generative foundation model alignment , author=. arXiv preprint arXiv:2304.06767 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2407.14622 , year=

Bond: Aligning llms with best-of-n distillation , author=. arXiv preprint arXiv:2407.14622 , year=

work page arXiv

[28] [28]

Advances in neural information processing systems , volume=

Implicit generation and modeling with energy based models , author=. Advances in neural information processing systems , volume=

work page

[29] [29]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page

[30] [30]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

work page

[32] [32]

arXiv preprint arXiv:2408.06195 , year=

Mutual reasoning makes smaller llms stronger problem-solvers , author=. arXiv preprint arXiv:2408.06195 , year=

work page arXiv

[33] [33]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

work page

[34] [34]

Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal =

work page

[35] [35]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=

work page

[38] [38]

2025 , howpublished =

work page 2025

[39] [39]

2025 , howpublished =

Introducing. 2025 , howpublished =

work page 2025

[40] [40]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025

[41] [41]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

work page

[43] [43]

Owen , year = 2013, title =

Art B. Owen , year = 2013, title =

work page 2013

[44] [44]

Monte Carlo Statistical Method , volume =

Robert, Christian and Casella, George , year =. Monte Carlo Statistical Method , volume =. Technometrics , doi =

work page

[45] [45]

Wiley Interdisciplinary Reviews: Computational Statistics , volume=

Importance sampling: a review , author=. Wiley Interdisciplinary Reviews: Computational Statistics , volume=. 2010 , publisher=

work page 2010

[46] [46]

arXiv preprint arXiv:2102.05407 , year=

Advances in importance sampling , author=. arXiv preprint arXiv:2102.05407 , year=

work page arXiv

[47] [47]

Journal of the American Statistical Association , volume=

Safe and effective importance sampling , author=. Journal of the American Statistical Association , volume=. 2000 , publisher=

work page 2000

[48] [48]

Eligibility traces for off-policy policy evaluation , author=

work page

[49] [49]

Advances in neural information processing systems , volume=

Safe and efficient off-policy reinforcement learning , author=. Advances in neural information processing systems , volume=

work page

[50] [50]

arXiv preprint arXiv:2507.21053 , year=

Flow matching policy gradients , author=. arXiv preprint arXiv:2507.21053 , year=

work page arXiv

[51] [51]

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation , author=. arXiv preprint arXiv:2604.13010 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

arXiv preprint arXiv:2510.13651 , year=

What is the objective of reasoning with reinforcement learning? , author=. arXiv preprint arXiv:2510.13651 , year=

work page arXiv

[53] [53]

2026 , eprint=

LAD: Learning Advantage Distribution for Reasoning , author=. 2026 , eprint=

work page 2026

[54] [54]

2024 , url =

American Invitational Mathematics Examination (AIME) 2024 , author =. 2024 , url =

work page 2024

[55] [55]

2025 , url =

American Invitational Mathematics Examination (AIME) 2025 , author =. 2025 , url =

work page 2025

[56] [56]

Hugging Face repository , howpublished =

CodeForces , author=. Hugging Face repository , howpublished =. 2025 , publisher =

work page 2025

[57] [57]

DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , author=

work page