QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

Anthony Zhan; Emily B. Fox; Kanishk Gandhi; Michael Y. Li; Noah D. Goodman

arxiv: 2607.01179 · v1 · pith:X2DABLZ6new · submitted 2026-07-01 · 💻 cs.LG · cs.CL

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

Michael Y. Li , Anthony Zhan , Kanishk Gandhi , Noah D. Goodman , Emily B. Fox This is my paper

Pith reviewed 2026-07-02 15:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords quasi-monte carlotest-time scalingcorrelated samplinglanguage modelsinference efficiencyreinforcement learningpass@ksample efficiency

0 comments

The pith

Quasi-Monte Carlo sampling matches independent sampling accuracy with 25-47 percent fewer samples on reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models waste inference compute when many parallel attempts are drawn independently and therefore overlap. QuasiMoTTo replaces those independent draws with correlated ones by reparameterizing autoregressive generation as inverse-CDF sampling and feeding the uniforms from a quasi-Monte Carlo sequence. Because the quasi-Monte Carlo points are spaced more evenly, the resulting outputs cover more of the possible answer space while each individual sample remains exactly distributed according to the model. The same batch can therefore be used for both accuracy measurement and policy-gradient training. Across four benchmarks the method reaches the same pass@k curve with substantially fewer samples and also accelerates GRPO training by cutting the number of steps in half.

Core claim

QuasiMoTTo reparameterizes autoregressive sampling as inverse-CDF sampling and draws the underlying uniforms with quasi-Monte Carlo sequences. The resulting batch is correlated yet each sample stays marginally distributed exactly according to the language model. This property lets the batch serve unchanged for pass@k evaluation and for policy-gradient updates. The lower redundancy produces higher coverage, which saturates an upper bound on pass@k that any marginal-preserving sampler must obey. Empirically the approach matches i.i.d. pass@k accuracy with 25-47 percent fewer samples and matches i.i.d. GRPO performance with 50 percent fewer training steps.

What carries the argument

Inverse-CDF reparameterization of autoregressive sampling paired with quasi-Monte Carlo low-discrepancy uniforms.

If this is right

QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples across four reasoning benchmarks.
QuasiMoTTo often saturates an upper bound on pass@k that holds for any marginal-preserving sampler.
QuasiMoTTo matches i.i.d. performance with 50% fewer training steps in GRPO.
Higher coverage from the correlated samples yields a stronger learning signal per batch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reparameterization could be applied to other parallel Monte Carlo tasks in generative modeling where sample overlap is costly.
Saturation of the pass@k bound implies that further efficiency gains would require samplers that change the marginal distribution itself.
The approach suggests that independence is not required for trivial parallel scaling once the marginals are preserved.

Load-bearing premise

Reparameterizing autoregressive sampling as inverse-CDF sampling lets quasi-Monte Carlo uniforms produce samples that remain exactly distributed according to the language model while adding useful correlation.

What would settle it

An experiment in which the empirical marginal distribution of QuasiMoTTo samples deviates from the language model or in which the bootstrap-estimated pass@k fails to exceed the i.i.d. curve for the same sample count.

Figures

Figures reproduced from arXiv: 2607.01179 by Anthony Zhan, Emily B. Fox, Kanishk Gandhi, Michael Y. Li, Noah D. Goodman.

**Figure 1.** Figure 1: Dependent samples for inference compute scaling and RL. (a) Inference compute scaling and RL rely on independent (i.i.d) sampling, which wastes compute generating redundant solutions. (b) QuasiMoTTo explores the design space of correlated samplers that cover the output space more evenly than i.i.d.; each individual sample remains an exact sample from the language model (LM). (c) Sampling is embarrassingly … view at source ↗

**Figure 2.** Figure 2: Same marginals, different joints. Two distributions over (x1, x2) with identical marginals (see histograms) but different dependence structure: independent (left) versus negatively correlated (right). Since we often only need to preserve marginals, we can exploit this flexibility to generate LM rollouts that are correlated in a way that better covers the output space. boost coverage, the procedure is both … view at source ↗

**Figure 3.** Figure 3: Exact sampling from the LM using inverse-CDF sampling. The LM produces logits over the answer choices. We sort them in descending order to yield a permutation σ and a sorted probability vector (panel 2). Stacking these probabilities partitions the unit interval into bins whose widths equal the token probabilities; a uniformly random point lands in each bin with probability equal to that token’s mass. To sa… view at source ↗

**Figure 4.** Figure 4: Bootstrapped QMC pass@k estimator. Because the k samples are dependent rather than i.i.d. draws, the standard pass@k estimator is invalid; instead, we exploit that any stride-2 subsample is itself a valid pass@k/2 lattice. Stride 2 admits two starting offsets, each yielding an unbiased pass@4 estimate from the same pass@8 rollout. This generalizes to larger strides. 2.4 Estimators that depend on the joint … view at source ↗

**Figure 5.** Figure 5: The joint probability of two sequences is determined by the intersection of intervals whose [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: i.i.d. sampling falls short of the per-problem pass@k upper bound. For a problem with pass@1 probability p, the union bound gives a ceiling of min(pk, 1) (black) that places an upper bound on any sampler with target marginals. The i.i.d. curve 1−(1−p) k (coral) lies strictly below this ceiling; the shaded gap represents the redundancy of independent sampling. Carefully constructed dependent samples can clo… view at source ↗

**Figure 7.** Figure 7: QuasiMoTTo pass@k analysis. QuasiMoTTo (lattice) consistently dominates i.i.d. sampling across all four reasoning benchmarks. Strikingly, QuasiMoTTo closely tracks the pass@k upper bound that no training-free sampler can exceed. Note, we plot error bars ±1 SD but they are small enough that they are not visible in the plot above. Importantly, the upper bound curve aggregates over multiple problems, which is… view at source ↗

**Figure 8.** Figure 8: QuasiMoTTo boosts sample efficiency. QuasiMoTTo can use 25%-47% fewer samples than i.i.d. sampling while achieving the same accuracy. These performance gains hold across different QMC methods although lattice performs best. 100 200 300 400 500 600 Checkpoint step 0.08 0.10 0.12 0.14 0.16 pass@1 Eval pass@1 vs. training step i.i.d. QuasiMoTTo (Lattice) QuasiMoTTo (Sobol) QuasiMoTTo (Stratified) 0.07 0.08 0.… view at source ↗

**Figure 9.** Figure 9: QuasiMoTTo compute efficiency for RL. We plot the pass@1 on evaluation set against the training step. QuasiMoTTo achieves the same pass@1 in fewer steps compared to i.i.d sampling. within a group, it can boost the effective sample size per gradient step and decrease the number of training steps required to achieve a target performance. Setup. We train Qwen3.5-0.8B-Base on Maze and Sudoku using GRPO [11] to… view at source ↗

**Figure 10.** Figure 10: QuasiMoTTo RL training dynamics. (left) QuasiMoTTo produces consistently fewer zero-variance groups (i.e., groups in which G rollouts all succeed or all fail and contribute no learning signal) throughout training because it samples responses with higher coverage. (right) This larger effective sample size causes QuasiMoTTo’s training reward to rise faster. sampling [35]. Note that, because we use a group b… view at source ↗

**Figure 11.** Figure 11: A visual overview of the QMC samplers. (a) Independent, stratified, and lattice sampling are 1-D samplers that trade freedom for coverage: from top to bottom, freedom falls and coverage rises, tracked by the pairwise mutual information I(Ui ;Uj ). (b) Token-level Sobol folds the sequence dimension n into the QMC space. In this n = 2 toy, the axes are the first and second tokens (with a conditional second-… view at source ↗

**Figure 12.** Figure 12: Representative examples from each of the four tasks. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: QuasiMoTTo pass@k analysis for all samplers. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: QuasiMoTTo compute efficiency for RL. We plot the pass@1 on evaluation set against the training step. QuasiMoTTo achieves the same pass@1 in fewer steps compared to i.i.d sampling. Error bars are computed as follows: we estimate the accuracy pi for each of n questions and report std(p) √ n . A.3 Theoretical analyses Theorem 1 (Correctness of dyadic bootstrap sampling). Let k = 2L, and let ui = ∆ + i k … view at source ↗

**Figure 15.** Figure 15: The pairwise probability qij = λT(Ki ∩ Kj ) decomposes into four terms by Proposition 2. In each row, we visualize the intersections, showing both the intersection on the circle (left) and the corresponding intersections on the unit interval (arbitrarily) cut at 0 (right). Rows 1–2 show the two pieces of the overlap adjacent to the cut at 0, with lengths min(ri , rj ) and min(ℓi , ℓj ). Rows 3–4 show the … view at source ↗

read the original abstract

Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redundant solutions. This waste seems unavoidable. After all, independence is what makes parallel sampling trivial to scale. However, this tradeoff is not fundamental: there is a rich design space of samplers that generate correlated but exact samples entirely in parallel. We explore this design space as an avenue for improving sample efficiency in scaling inference compute and reinforcement learning (RL). Concretely, we introduce QuasiMoTTo, which uses correlated samples as a drop-in replacement for i.i.d. samples. To generate these samples, QuasiMoTTo uses a reparameterization of autoregressive sampling as inverse-CDF sampling and draws the underlying uniforms with quasi-Monte Carlo (QMC); because QMC spreads the uniforms out more evenly than i.i.d., the resulting samples cover the output space with far less redundancy. Even though the batch is correlated, each sample is marginally distributed according to the language model, so we can use the batch for policy-gradient training. Our empirical analysis focuses on understanding how efficiently QuasiMoTTo can turn compute into performance. To evaluate correlated samplers, whose dependence breaks standard pass@k estimators, we first develop an unbiased bootstrap estimator. Across four reasoning benchmarks, QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples. Strikingly, QuasiMoTTo often saturates an upper bound on pass@k that holds for any marginal-preserving sampler. We also apply QuasiMoTTo to policy-gradient RL (GRPO) where it matches i.i.d. performance with 50% fewer training steps. These gains come from higher coverage, which yields a stronger learning signal per batch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QuasiMoTTo shows QMC on inverse-CDF uniforms can cut samples needed for pass@k by 25-47% and training steps in GRPO by 50%, with a new bootstrap estimator, but the exact marginal preservation under deterministic QMC needs verification.

read the letter

QuasiMoTTo's core move is to reparameterize autoregressive LM sampling as inverse-CDF draws and replace i.i.d. uniforms with QMC sequences. This produces correlated batches that still claim to be marginally correct, which lets them plug directly into pass@k and policy gradients.

The new piece is applying QMC this way to test-time scaling and GRPO, plus the unbiased bootstrap they built to measure pass@k when samples are dependent. The reported gains are concrete: 25-47% fewer samples to match i.i.d. accuracy on four reasoning benchmarks, and often hitting the theoretical upper bound that any marginal-preserving sampler can reach. The GRPO result with half the steps is also straightforward to check.

The main soft spot is the marginal claim. The abstract states each sample remains distributed exactly according to the LM, but standard deterministic QMC sequences are fixed points. Without an explicit randomization layer (scrambling or random shift) the induced distribution on token sequences is a point mass, not the target LM. The bootstrap corrects for dependence but not for marginal bias. If the paper does not include that randomization step, the efficiency numbers and the saturation claim become hard to compare to i.i.d. baselines. I'd want the methods section to show the exact construction and any verification that marginals are preserved.

The work is aimed at people already running large-scale parallel sampling or RL post-training on language models. Anyone measuring sample efficiency or trying to reduce redundancy in inference will find the estimator and the empirical pattern useful.

It is worth sending to referees. The technique is new enough and the empirical pattern is clear enough that a careful review can settle the marginal question and check reproducibility.

Referee Report

2 major / 2 minor

Summary. The paper introduces QuasiMoTTo, which reparameterizes autoregressive LM sampling as inverse-CDF sampling and replaces i.i.d. uniforms with QMC uniforms to generate correlated samples that remain marginally distributed according to the LM. This is positioned as a drop-in replacement for i.i.d. sampling to reduce redundancy in test-time scaling for reasoning and in GRPO policy gradients. An unbiased bootstrap estimator is developed to handle dependence when computing pass@k. Empirical results claim that QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples across four benchmarks (often saturating a marginal-preserving upper bound) and matches i.i.d. GRPO performance with 50% fewer training steps.

Significance. If the marginal-preservation claim holds exactly and the bootstrap estimator is unbiased, the work demonstrates a concrete way to improve coverage and sample efficiency in parallel inference without sacrificing correctness or introducing bias in RL gradients. The bootstrap estimator itself is a useful methodological tool for dependent samplers. The saturation of the upper bound and the GRPO gains are notable if reproducible, but the overall significance hinges on verification that the QMC construction does not perturb the marginals.

major comments (2)

[Abstract and reparameterization methods] The central claim that 'each sample is marginally distributed according to the language model' after the inverse-CDF + QMC reparameterization is load-bearing for all pass@k and GRPO results. The abstract provides no indication that randomized QMC (random shift, scrambling, or Owen scrambling) is applied; deterministic QMC sequences induce a point-mass distribution on token sequences rather than the target LM marginals. This must be clarified with the precise construction used (e.g., which QMC sequence and randomization layer) in the methods section, as the bootstrap estimator only corrects for dependence and does not restore marginal correctness.
[Bootstrap estimator development] The unbiasedness of the bootstrap estimator for pass@k under QMC-induced dependence is asserted but not derived in the provided abstract. The estimator must be shown to remain unbiased specifically for the correlation structure produced by the QMC uniforms (not generic dependence), otherwise the 25-47% efficiency numbers cannot be directly compared to i.i.d. baselines.

minor comments (2)

[Abstract] The four reasoning benchmarks are not named in the abstract; listing them (e.g., GSM8K, MATH, etc.) would improve immediate readability.
[Methods] Notation for the inverse-CDF transform and the QMC sequence should be introduced with explicit equations early in the methods to make the reparameterization reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The two major comments highlight important points on marginal preservation and estimator unbiasedness that we address below. We will revise the manuscript to improve clarity and add the requested details.

read point-by-point responses

Referee: [Abstract and reparameterization methods] The central claim that 'each sample is marginally distributed according to the language model' after the inverse-CDF + QMC reparameterization is load-bearing for all pass@k and GRPO results. The abstract provides no indication that randomized QMC (random shift, scrambling, or Owen scrambling) is applied; deterministic QMC sequences induce a point-mass distribution on token sequences rather than the target LM marginals. This must be clarified with the precise construction used (e.g., which QMC sequence and randomization layer) in the methods section, as the bootstrap estimator only corrects for dependence and does not restore marginal correctness.

Authors: We agree that explicit clarification is needed. The full manuscript implements randomized QMC via a randomly shifted Sobol sequence (base-2, dimension equal to the maximum sequence length), which ensures each individual uniform is exactly uniform on [0,1] and thus each autoregressive sample is marginally distributed according to the LM. The random shift is drawn once per batch and provides the required marginal correctness while inducing the negative dependence that improves coverage. We will revise the abstract to mention 'randomized quasi-Monte Carlo' and expand the methods section with the precise construction, including the randomization layer and a citation to the RQMC literature. This directly addresses the concern that marginals must be verified independently of the bootstrap. revision: yes
Referee: [Bootstrap estimator development] The unbiasedness of the bootstrap estimator for pass@k under QMC-induced dependence is asserted but not derived in the provided abstract. The estimator must be shown to remain unbiased specifically for the correlation structure produced by the QMC uniforms (not generic dependence), otherwise the 25-47% efficiency numbers cannot be directly compared to i.i.d. baselines.

Authors: The bootstrap estimator is constructed to be unbiased under the specific negative dependence induced by the randomized QMC uniforms. Because marginal correctness is preserved, the pass@k functional has the same expectation as under i.i.d. sampling; the bootstrap then resamples blocks that respect the observed joint structure while remaining unbiased for that expectation. We will add a self-contained derivation in the appendix that explicitly uses the RQMC correlation properties (negative association of the shifted points) rather than generic dependence, confirming that the 25-47% efficiency gains are valid comparisons to i.i.d. baselines. The current empirical numbers already employ this estimator. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces QuasiMoTTo via a reparameterization of autoregressive LM sampling as inverse-CDF sampling driven by QMC uniforms, asserts marginal preservation by the standard properties of that transform, develops a bootstrap estimator to handle dependence for pass@k evaluation, and reports empirical gains on benchmarks plus GRPO. No load-bearing step reduces by the paper's own equations or self-citations to a fitted input, self-defined quantity, or prior author result; the upper bound is stated as holding for any marginal-preserving sampler and the performance numbers are measured against external i.i.d. baselines. The derivation is therefore self-contained against external benchmarks rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, ad-hoc axioms, or invented entities; the method rests on standard QMC uniformity properties and the correctness of the inverse-CDF reparameterization, both treated as background.

pith-pipeline@v0.9.1-grok · 5877 in / 1133 out tokens · 24123 ms · 2026-07-02T15:15:08.013867+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning

Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley Series in Telecom- munications and Signal Processing. Wiley-Interscience, USA, 2 edition, 2006. ISBN 0471241954

2006
[4]

Flashattention: Fast and memory-efficient exact attention with IO-awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

2022
[5]

Deepseek-v3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

2025
[6]

CARMS: Categorical-antithetic-REINFORCE multi- sample gradient estimator

Alek Dimitriev and Mingyuan Zhou. CARMS: Categorical-antithetic-REINFORCE multi- sample gradient estimator. InAdvances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021

2021
[7]

Stream of search (sos): Learning to search in language

Kanishk Gandhi, Denise H J Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah Goodman. Stream of search (sos): Learning to search in language. InFirst Conference on Language Modeling, 2024

2024
[8]

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STars

Kanishk Gandhi, Ayush K Chakravarthy, Anikait Singh, Nathan Lile, and Noah Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STars. InSecond Conference on Language Modeling, 2025

2025
[9]

Li, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mo- hammed Zaman, and Noah D

Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mo- hammed Zaman, and Noah D. Goodman. Boxinggym: Benchmarking progress in automated experimental design and model discovery, 2025

2025
[10]

Speeding up rl with high- leverage samples

Agastya Goel and Linden Li. Speeding up rl with high- leverage samples. https://www.appliedcompute.com/research/ speeding-up-rl-with-high-leverage-samples, 2026

2026
[11]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, page 633–638, 2025

2025
[12]

SPIRAL: Learning to Search and Aggregate

Jubayer Ibn Hamid, Ifdita Hasan Orney, Michael Y. Li, Omar Shaikh, Yoonho Lee, Dorsa Sadigh, Chelsea Finn, and Noah Goodman. Spiral: Learning to search and aggregate, 2026. URL https://arxiv.org/abs/2606.23595. 17

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Polychromic objectives for reinforcement learning

Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, and Dorsa Sadigh. Polychromic objectives for reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[14]

Li, Sherry Yang, Chelsea Finn, Emma Brunskill, and Noah D

Joy He-Yueya, Anikait Singh, Ge Gao, Michael Y. Li, Sherry Yang, Chelsea Finn, Emma Brunskill, and Noah D. Goodman. Giants: Generative insight anticipation from scientific literature, 2026

2026
[15]

Stochastic beams and where to find them: The Gumbel-top-k trick for sampling sequences without replacement

Wouter Kool, Herke Van Hoof, and Max Welling. Stochastic beams and where to find them: The Gumbel-top-k trick for sampling sequences without replacement. InProceedings of the 36th International Conference on Machine Learning, 2019

2019
[16]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023

2023
[17]

Li, Emily B

Michael Y. Li, Emily B. Fox, and Noah D. Goodman. Automated statistical model discovery with language models, 2024

2024
[18]

Li, Jubayer Ibn Hamid, Emily B

Michael Y. Li, Jubayer Ibn Hamid, Emily B. Fox, and Noah D. Goodman. Neural garbage collection: Learning to forget while learning to reason, 2026

2026
[19]

David J. C. MacKay. Information theory, inference & learning algorithms, 2002

2002
[20]

Divide-and-conquer cot: Rl for reducing latency via parallel reasoning, 2026

Arvind Mahankali, Kaiyue Wen, and Tengyu Ma. Divide-and-conquer cot: Rl for reducing latency via parallel reasoning, 2026

2026
[21]

Li, and Emily B

Yuzhen Mao, Michael Y. Li, and Emily B. Fox. Simplified sparse attention via gist tokens, 2026

2026
[22]

Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algo...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Openai o1 system card, 2026

OpenAI. Openai o1 system card, 2026

2026
[24]

Poly-epo: Training exploratory reasoning models, 2026

Ifdita Hasan Orney, Jubayer Ibn Hamid, Shreya S Ramanujam, Shirley Wu, Hengyuan Hu, Noah Goodman, Dorsa Sadigh, and Chelsea Finn. Poly-epo: Training exploratory reasoning models, 2026

2026
[25]

Owen.Monte Carlo Theory, Methods and Examples

Art B. Owen.Monte Carlo Theory, Methods and Examples. Art Owen, 2013

2013
[26]

Learning adaptive parallel reasoning with language models

Jiayi Pan, Xiuyu Li, Long Lian, Charlie Victor Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. Learning adaptive parallel reasoning with language models. In Second Conference on Language Modeling, 2025

2025
[27]

Quasi-random multi-sample inference for large language models, 2025

Aditya Parashar, Aditya Vikram Singh, Avinash Amballa, Jinlin Lai, and Benjamin Rozonoyer. Quasi-random multi-sample inference for large language models, 2025

2025
[28]

On the distribution of points in a cube and the approximate evaluation of integrals

I.M Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4):86–112, 1967. ISSN 0041-5553. doi: https://doi.org/10.1016/0041-5553(67)90144-9. URL https://www.sciencedirect. com/science/article/pii/0041555367901449. 18

work page doi:10.1016/0041-5553(67)90144-9 1967
[29]

Maximum likelihood reinforcement learning, 2026

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning, 2026

2026
[30]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models.CoRR, abs/1610.02424, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[31]

Arith- metic sampling: parallel diverse decoding for large language models

Luke Vilnis, Yury Zemlyanskiy, Patrick Murray, Alexandre Passos, and Sumit Sanghai. Arith- metic sampling: parallel diverse decoding for large language models. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

2023
[32]

New York, 2010

Larry Wasserman.All of statistics : a concise course in statistical inference. New York, 2010

2010
[33]

Yudong Xu, Wenhao Li, Pashootan Vaezipoor, Scott Sanner, and Elias B. Khalil. Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations, 2024

2024
[34]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Group sequence policy optimization, 2025

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. 19 A Technical appendices and supplementary material A.1 Experiment Details Maze Countdown ********* *E..*...* *.***.*** *.......* ***.*.*** *...*...* *S*.***** *.*.......

2025

[1] [1]

Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning

Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, and Shaul Druckmann. Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves mathematical reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[2] [2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley Series in Telecom- munications and Signal Processing. Wiley-Interscience, USA, 2 edition, 2006. ISBN 0471241954

2006

[4] [4]

Flashattention: Fast and memory-efficient exact attention with IO-awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

2022

[5] [5]

Deepseek-v3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

2025

[6] [6]

CARMS: Categorical-antithetic-REINFORCE multi- sample gradient estimator

Alek Dimitriev and Mingyuan Zhou. CARMS: Categorical-antithetic-REINFORCE multi- sample gradient estimator. InAdvances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021

2021

[7] [7]

Stream of search (sos): Learning to search in language

Kanishk Gandhi, Denise H J Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah Goodman. Stream of search (sos): Learning to search in language. InFirst Conference on Language Modeling, 2024

2024

[8] [8]

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STars

Kanishk Gandhi, Ayush K Chakravarthy, Anikait Singh, Nathan Lile, and Noah Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective STars. InSecond Conference on Language Modeling, 2025

2025

[9] [9]

Li, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mo- hammed Zaman, and Noah D

Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mo- hammed Zaman, and Noah D. Goodman. Boxinggym: Benchmarking progress in automated experimental design and model discovery, 2025

2025

[10] [10]

Speeding up rl with high- leverage samples

Agastya Goel and Linden Li. Speeding up rl with high- leverage samples. https://www.appliedcompute.com/research/ speeding-up-rl-with-high-leverage-samples, 2026

2026

[11] [11]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, page 633–638, 2025

2025

[12] [12]

SPIRAL: Learning to Search and Aggregate

Jubayer Ibn Hamid, Ifdita Hasan Orney, Michael Y. Li, Omar Shaikh, Yoonho Lee, Dorsa Sadigh, Chelsea Finn, and Noah Goodman. Spiral: Learning to search and aggregate, 2026. URL https://arxiv.org/abs/2606.23595. 17

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Polychromic objectives for reinforcement learning

Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, and Dorsa Sadigh. Polychromic objectives for reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[14] [14]

Li, Sherry Yang, Chelsea Finn, Emma Brunskill, and Noah D

Joy He-Yueya, Anikait Singh, Ge Gao, Michael Y. Li, Sherry Yang, Chelsea Finn, Emma Brunskill, and Noah D. Goodman. Giants: Generative insight anticipation from scientific literature, 2026

2026

[15] [15]

Stochastic beams and where to find them: The Gumbel-top-k trick for sampling sequences without replacement

Wouter Kool, Herke Van Hoof, and Max Welling. Stochastic beams and where to find them: The Gumbel-top-k trick for sampling sequences without replacement. InProceedings of the 36th International Conference on Machine Learning, 2019

2019

[16] [16]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, 2023

2023

[17] [17]

Li, Emily B

Michael Y. Li, Emily B. Fox, and Noah D. Goodman. Automated statistical model discovery with language models, 2024

2024

[18] [18]

Li, Jubayer Ibn Hamid, Emily B

Michael Y. Li, Jubayer Ibn Hamid, Emily B. Fox, and Noah D. Goodman. Neural garbage collection: Learning to forget while learning to reason, 2026

2026

[19] [19]

David J. C. MacKay. Information theory, inference & learning algorithms, 2002

2002

[20] [20]

Divide-and-conquer cot: Rl for reducing latency via parallel reasoning, 2026

Arvind Mahankali, Kaiyue Wen, and Tengyu Ma. Divide-and-conquer cot: Rl for reducing latency via parallel reasoning, 2026

2026

[21] [21]

Li, and Emily B

Yuzhen Mao, Michael Y. Li, and Emily B. Fox. Simplified sparse attention via gist tokens, 2026

2026

[22] [22]

Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algo...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Openai o1 system card, 2026

OpenAI. Openai o1 system card, 2026

2026

[24] [24]

Poly-epo: Training exploratory reasoning models, 2026

Ifdita Hasan Orney, Jubayer Ibn Hamid, Shreya S Ramanujam, Shirley Wu, Hengyuan Hu, Noah Goodman, Dorsa Sadigh, and Chelsea Finn. Poly-epo: Training exploratory reasoning models, 2026

2026

[25] [25]

Owen.Monte Carlo Theory, Methods and Examples

Art B. Owen.Monte Carlo Theory, Methods and Examples. Art Owen, 2013

2013

[26] [26]

Learning adaptive parallel reasoning with language models

Jiayi Pan, Xiuyu Li, Long Lian, Charlie Victor Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. Learning adaptive parallel reasoning with language models. In Second Conference on Language Modeling, 2025

2025

[27] [27]

Quasi-random multi-sample inference for large language models, 2025

Aditya Parashar, Aditya Vikram Singh, Avinash Amballa, Jinlin Lai, and Benjamin Rozonoyer. Quasi-random multi-sample inference for large language models, 2025

2025

[28] [28]

On the distribution of points in a cube and the approximate evaluation of integrals

I.M Sobol’. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4):86–112, 1967. ISSN 0041-5553. doi: https://doi.org/10.1016/0041-5553(67)90144-9. URL https://www.sciencedirect. com/science/article/pii/0041555367901449. 18

work page doi:10.1016/0041-5553(67)90144-9 1967

[29] [29]

Maximum likelihood reinforcement learning, 2026

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning, 2026

2026

[30] [30]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. Diverse beam search: Decoding diverse solutions from neural sequence models.CoRR, abs/1610.02424, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[31] [31]

Arith- metic sampling: parallel diverse decoding for large language models

Luke Vilnis, Yury Zemlyanskiy, Patrick Murray, Alexandre Passos, and Sumit Sanghai. Arith- metic sampling: parallel diverse decoding for large language models. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

2023

[32] [32]

New York, 2010

Larry Wasserman.All of statistics : a concise course in statistical inference. New York, 2010

2010

[33] [33]

Yudong Xu, Wenhao Li, Pashootan Vaezipoor, Scott Sanner, and Elias B. Khalil. Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations, 2024

2024

[34] [34]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Group sequence policy optimization, 2025

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. 19 A Technical appendices and supplementary material A.1 Experiment Details Maze Countdown ********* *E..*...* *.***.*** *.......* ***.*.*** *...*...* *S*.***** *.*.......

2025