DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
Pith reviewed 2026-05-20 15:06 UTC · model grok-4.3
The pith
DISA estimates the prompt-dependent partition function offline with importance sampling on proposal trajectories and freezes the estimate before policy optimization begins in distribution-matching LLM-RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins, preserving the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics.
What carries the argument
The frozen importance-sampled estimate of the prompt-dependent partition function, which decouples calibration from the subsequent policy updates.
If this is right
- On two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL.
- DISA outperforms reward-maximization baselines GRPO and GSPO on math averages.
- DISA exceeds LoRASFT distillation by up to 13.8 Mean@8 points when using the same offline trajectories.
- An LLM-as-judge evaluation shows DISA retains substantially more strategy-level diversity than reward-maximization baselines.
- Sensitivity studies on proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.
Where Pith is reading between the lines
- The offline nature of the estimation step could reduce training-time compute by avoiding repeated partition-function calculations during policy updates.
- Separate diagnostics for the frozen estimate might make it easier to diagnose whether poor final performance stems from bad calibration or from the optimization process itself.
- Similar offline anchoring could be tested in other sequential generation settings that require matching a target distribution over multiple valid outputs.
- The bias-variance tradeoff observed in the sensitivity studies suggests that proposal quality remains a key lever even after the estimate is frozen.
Load-bearing premise
Offline trajectories from a fixed proposal distribution are representative enough to yield an accurate low-variance importance-sampling estimate of the partition function that can be frozen without distorting later policy updates.
What would settle it
A direct comparison showing that policy optimization with the frozen offline estimate produces measurably different solution distributions or lower performance than an otherwise identical run that estimates the partition function online would indicate the decoupling fails to preserve the objective.
Figures
read the original abstract
Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins. This decoupling preserves the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics. Empirically, on two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL, outperforms rewardmaximization baselines GRPO and GSPO on math averages, and exceeds LoRASFT distillation by up to 13.8 Mean@8 points on the same offline trajectories. An LLM-as-judge evaluation further shows that DISA retains substantially more strategy-level diversity than reward-maximization baselines, and sensitivity studies on the proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DISA (Decoupled Importance-Sampled Anchoring) for distribution-matching RL in LLMs. It draws proposal trajectories offline from a fixed distribution, estimates the prompt-dependent partition function Z via importance sampling, freezes the resulting estimate, and then performs separate policy optimization. The central claim is that this decoupling preserves the exact distribution-matching objective while cleanly separating estimation from learning in data, gradients, loss, and diagnostics. Experiments across math and code benchmarks on open-weight models report performance matching or exceeding the online-coupled baseline FlowRL, outperforming reward-maximization methods like GRPO and GSPO on math averages, and retaining higher strategy-level diversity.
Significance. If the offline IS estimates of Z prove sufficiently low-variance and unbiased, the decoupling would enable independent calibration and diagnostics for distribution-matching objectives, mitigating the mode-collapse issues of standard reward-max RL while reducing online computational overhead. The reported diversity gains and sensitivity studies aligned with bias-variance predictions would strengthen the case for practical adoption in reasoning agents.
major comments (2)
- [§3.2, Eq. (7)] §3.2, Eq. (7): the claim that freezing the IS estimate of the prompt-dependent partition function Z preserves the exact distribution-matching objective is load-bearing for the decoupling result, yet no variance bound or concentration analysis is given for the estimator weights exp(r(τ))/q(τ) when q is a base model or SFT checkpoint that assigns near-zero mass to rare high-reward trajectories.
- [Table 2 and §5.3] Table 2 and §5.3: the reported Mean@8 gains over LoRASFT and FlowRL lack error bars, number of seeds, or exact prompt sampling protocols, so it is impossible to determine whether the observed improvements are statistically distinguishable from the variance that would arise from an inaccurate frozen Z.
minor comments (2)
- [§2] The notation for the proposal strength and inverse temperature parameters is introduced without an explicit table of symbols, making cross-references between the analysis and experiments harder to follow.
- [Figure 3] Figure 3 caption should state the exact number of offline proposal trajectories used per prompt so that readers can assess the IS sample size directly.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [§3.2, Eq. (7)] §3.2, Eq. (7): the claim that freezing the IS estimate of the prompt-dependent partition function Z preserves the exact distribution-matching objective is load-bearing for the decoupling result, yet no variance bound or concentration analysis is given for the estimator weights exp(r(τ))/q(τ) when q is a base model or SFT checkpoint that assigns near-zero mass to rare high-reward trajectories.
Authors: We thank the referee for identifying this key theoretical point. The importance-sampling estimator for the prompt-dependent partition function Z is unbiased by construction, since its expectation under trajectories drawn from the proposal q recovers the true Z exactly. Consequently, the distribution-matching objective is preserved in expectation when the estimate is frozen. We acknowledge, however, that the manuscript provides no explicit variance bound or concentration result for the weights exp(r(τ))/q(τ), particularly when q is a base model or SFT checkpoint with limited mass on rare high-reward trajectories. This is a fair observation. In the revised manuscript we will add a short discussion in §3.2 clarifying that the objective is preserved in expectation, together with a qualitative analysis of the bias-variance tradeoff and a simple variance bound under the mild assumption that the proposal has positive (though possibly small) coverage of the relevant support. We will also report the effective sample size of the importance weights for each benchmark to give readers a practical sense of estimator stability. This is a partial revision: we strengthen the presentation and add supporting analysis without claiming a new tight concentration inequality. revision: partial
-
Referee: [Table 2 and §5.3] Table 2 and §5.3: the reported Mean@8 gains over LoRASFT and FlowRL lack error bars, number of seeds, or exact prompt sampling protocols, so it is impossible to determine whether the observed improvements are statistically distinguishable from the variance that would arise from an inaccurate frozen Z.
Authors: We agree that the absence of error bars, seed counts, and precise sampling protocols limits the ability to assess statistical distinguishability of the reported gains from variance attributable to the frozen Z estimate. This was an oversight in the original submission. In the revised version we will (i) recompute and report all Mean@8 numbers in Table 2 with standard-error bars obtained from five independent random seeds, (ii) add an explicit description in §5.3 of the prompt-sampling protocol (including how prompts were drawn from each benchmark and the exact number of prompts per task), and (iii) include a brief sensitivity paragraph discussing how moderate perturbations to the frozen Z affect downstream performance. These additions will allow readers to evaluate whether the observed improvements exceed the combined variance from both the Z estimator and training stochasticity. revision: yes
Circularity Check
No significant circularity; derivation self-contained via methodological decoupling
full rationale
The paper presents DISA as a decoupling of offline importance sampling for estimating the prompt-dependent partition function Z from subsequent policy optimization, with the estimate frozen before RL begins. No equations or steps in the provided abstract or description reduce the central claim to a self-definition, fitted input renamed as prediction, or load-bearing self-citation. The distribution-matching objective is preserved by construction of the separation in data/gradients/loss, and empirical results on benchmarks provide independent content. This is the common case of an honest non-finding.
Axiom & Free-Parameter Ledger
free parameters (2)
- proposal strength
- inverse temperature
axioms (1)
- domain assumption Importance sampling from offline proposal trajectories can produce a usable estimate of the prompt-dependent partition function.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the trajectory-balance (TB) loss LTB(θ, ϕ) = E[(log Z_ϕ(q) + log π_θ(o|q) − log π_ref(o|q) − β r̃(q,o))²]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2509.15207 , year=
Flowrl: Matching reward distributions for llm reasoning , author=. arXiv preprint arXiv:2509.15207 , year=
-
[2]
arXiv preprint arXiv:2310.04363 , year=
Amortizing intractable inference in large language models , author=. arXiv preprint arXiv:2310.04363 , year=
-
[3]
arXiv preprint arXiv:2406.05673 , year=
Flow of reasoning: Training llms for divergent reasoning with minimal examples , author=. arXiv preprint arXiv:2406.05673 , year=
-
[4]
arXiv preprint arXiv:2503.18929 , year=
Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training , author=. arXiv preprint arXiv:2503.18929 , year=
-
[5]
Advances in neural information processing systems , volume=
Flow network based generative models for non-iterative diverse candidate generation , author=. Advances in neural information processing systems , volume=
-
[6]
Advances in Neural Information Processing Systems , volume=
Trajectory balance: Improved credit assignment in gflownets , author=. Advances in Neural Information Processing Systems , volume=
-
[7]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
arXiv preprint arXiv:2501.12948 , year =
work page internal anchor Pith review Pith/arXiv arXiv
- [11]
-
[12]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization , author=. arXiv preprint arXiv:2501.03262 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Understanding R1-Zero-Like Training: A Critical Perspective
Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Group Sequence Policy Optimization
Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
arXiv preprint arXiv:2504.14286 , year=
Srpo: A cross-domain implementation of large-scale reinforcement learning on llm , author=. arXiv preprint arXiv:2504.14286 , year=
-
[16]
Agentic Reinforced Policy Optimization
Agentic reinforced policy optimization , author=. arXiv preprint arXiv:2507.19849 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2507.12507 , year=
Scaling up rl: Unlocking diverse reasoning in llms via prolonged training , author=. arXiv preprint arXiv:2507.12507 , year=
-
[18]
arXiv preprint arXiv:2505.09655 , year=
Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero-like training of large language models , author=. arXiv preprint arXiv:2505.09655 , year=
-
[19]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Reasoning with exploration: An entropy perspective on reinforcement learning for llms, 2025 , author=
work page 2025
-
[22]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[23]
International Conference on Artificial Intelligence and Statistics , pages=
A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=
work page 2024
-
[24]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[25]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Raft: Reward ranked finetuning for generative foundation model alignment , author=. arXiv preprint arXiv:2304.06767 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
arXiv preprint arXiv:2407.14622 , year=
Bond: Aligning llms with best-of-n distillation , author=. arXiv preprint arXiv:2407.14622 , year=
-
[28]
Advances in neural information processing systems , volume=
Implicit generation and modeling with energy based models , author=. Advances in neural information processing systems , volume=
-
[29]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[30]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Advances in neural information processing systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
-
[32]
arXiv preprint arXiv:2408.06195 , year=
Mutual reasoning makes smaller llms stronger problem-solvers , author=. arXiv preprint arXiv:2408.06195 , year=
-
[33]
The twelfth international conference on learning representations , year=
Let's verify step by step , author=. The twelfth international conference on learning representations , year=
-
[34]
Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal =
-
[35]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Advances in neural information processing systems , volume=
Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
-
[38]
2025 , howpublished =
work page 2025
- [39]
- [40]
-
[41]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Proceedings of the Twentieth European Conference on Computer Systems , pages=
Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
- [43]
-
[44]
Monte Carlo Statistical Method , volume =
Robert, Christian and Casella, George , year =. Monte Carlo Statistical Method , volume =. Technometrics , doi =
-
[45]
Wiley Interdisciplinary Reviews: Computational Statistics , volume=
Importance sampling: a review , author=. Wiley Interdisciplinary Reviews: Computational Statistics , volume=. 2010 , publisher=
work page 2010
-
[46]
arXiv preprint arXiv:2102.05407 , year=
Advances in importance sampling , author=. arXiv preprint arXiv:2102.05407 , year=
-
[47]
Journal of the American Statistical Association , volume=
Safe and effective importance sampling , author=. Journal of the American Statistical Association , volume=. 2000 , publisher=
work page 2000
-
[48]
Eligibility traces for off-policy policy evaluation , author=
-
[49]
Advances in neural information processing systems , volume=
Safe and efficient off-policy reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[50]
arXiv preprint arXiv:2507.21053 , year=
Flow matching policy gradients , author=. arXiv preprint arXiv:2507.21053 , year=
-
[51]
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation , author=. arXiv preprint arXiv:2604.13010 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
arXiv preprint arXiv:2510.13651 , year=
What is the objective of reasoning with reinforcement learning? , author=. arXiv preprint arXiv:2510.13651 , year=
-
[53]
LAD: Learning Advantage Distribution for Reasoning , author=. 2026 , eprint=
work page 2026
-
[54]
American Invitational Mathematics Examination (AIME) 2024 , author =. 2024 , url =
work page 2024
-
[55]
American Invitational Mathematics Examination (AIME) 2025 , author =. 2025 , url =
work page 2025
-
[56]
Hugging Face repository , howpublished =
CodeForces , author=. Hugging Face repository , howpublished =. 2025 , publisher =
work page 2025
-
[57]
DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level , author=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.