$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models

Ahsan Bilal; Asad Aali; Dean F. Hougen; Emily Fox; Muhammad Ahmed Mohsin; Muhammad Umer; Muhammad Usman Khanzada; Muhammad Usman Rafique; Zihao He

arxiv: 2604.06260 · v1 · submitted 2026-04-07 · 💻 cs.LG · cs.AI

S³: Stratified Scaling Search for Test-Time in Diffusion Language Models

Ahsan Bilal , Muhammad Ahmed Mohsin , Muhammad Umer , Asad Aali , Muhammad Usman Khanzada , Muhammad Usman Rafique , Zihao He , Emily Fox

show 1 more author

Dean F. Hougen

This is my paper

Pith reviewed 2026-05-10 20:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords diffusion language modelstest-time scalingdenoising trajectoriessearch methodsmathematical reasoningverifier-guided samplinginference compute

0 comments

The pith

Stratified Scaling Search reallocates compute during denoising in diffusion language models to improve output quality with extra test-time inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusion language models can achieve better results by directing additional inference compute toward promising intermediate states in the denoising process rather than sampling only final outputs. Naive best-of-K approaches repeatedly draw from the same base distribution and therefore miss higher-quality regions that the model can reach but does not favor under standard sampling. S³ expands multiple trajectories at each denoising step, scores them with a lightweight verifier that requires no reference answers, and resamples the stronger candidates while maintaining diversity. This procedure produces consistent gains on reasoning and question-answering benchmarks, with the clearest improvements on mathematical tasks, all without altering the underlying model or its fixed decoding schedule. A sympathetic reader would care because the method demonstrates a practical way to extract more performance from an already-trained diffusion language model simply by changing how compute is spent during generation.

Core claim

S³ approximates a reward-tilted sampling distribution by expanding multiple candidate trajectories at each denoising step, evaluating them with a lightweight reference-free verifier, selectively resampling promising candidates, and preserving diversity within the search frontier. The procedure reallocates compute across the denoising process instead of applying it only to final outputs, thereby improving generation quality for a fixed diffusion language model while remaining anchored to the model prior.

What carries the argument

Stratified Scaling Search (S³), the verifier-guided procedure that expands candidate trajectories at each denoising step, scores them, and resamples promising paths to tilt the effective distribution toward higher-quality outputs.

If this is right

S³ improves performance on MATH-500, GSM8K, ARC-Challenge, and TruthfulQA while leaving the base model and decoding schedule unchanged.
The largest gains appear on mathematical reasoning tasks.
The method approximates a reward-tilted distribution that favors higher-quality outputs while staying anchored to the model prior.
Classical search over denoising trajectories supplies a practical test-time scaling mechanism for diffusion language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar mid-process search to other generative models could shift test-time scaling away from final-output selection alone.
Improving the accuracy of reference-free verifiers would likely amplify the observed gains on step-by-step reasoning tasks.
The approach suggests that allocating compute at intermediate generation steps is more effective than increasing the number of final samples for the same total budget.

Load-bearing premise

A lightweight reference-free verifier can reliably identify promising intermediate trajectories during denoising without ground-truth answers or access to final output quality.

What would settle it

Running S³ alongside standard best-of-K sampling on LLaDA-8B-Instruct for the MATH-500 benchmark and observing no performance gain or a performance drop would falsify the claim that the stratified search improves outputs.

Figures

Figures reproduced from arXiv: 2604.06260 by Ahsan Bilal, Asad Aali, Dean F. Hougen, Emily Fox, Muhammad Ahmed Mohsin, Muhammad Umer, Muhammad Usman Khanzada, Muhammad Usman Rafique, Zihao He.

**Figure 2.** Figure 2: S 3 : N particles are initialized at t=T [P1], then at each step expanded into Nb candidates, scored via one-step clean predictions xˆ (i,j,t) 0 [P2], and resampled to N particles via SSP [P3]. Final output is selected by majority voting at t=0. distribution in Eq. (1). We approximate this target in practice using S 3 , an inference-time search procedure over denoising trajectories that requires no retrain… view at source ↗

**Figure 3.** Figure 3: Inference-time scaling with S 3 across datasets. Top row: accuracy vs. compute (NFE = steps × N × b) across multiple (N, b) settings. Bottom row: mean top-1 token confidence over denoising progress. Curves shown for Baseline, S 3 (N=4, b=2, λ=1.0), and BoK (K=8). 3.4 Verifier, compute budget, and final output selection We use a lightweight ground-truth-free composite verifier scoring candidate outputs via … view at source ↗

**Figure 4.** Figure 4: Accuracy (%) across block lengths K ∈ {2, 4, 8, 16, 32, 64} on four benchmarks. Results use LLaDA-8B-Instruct with N=4, b=2, and K=8. Orange bars denote S 3 , blue bars denote Baseline, and green bars denote BoK. An upward arrow (↑) indicates configurations where the leading method improves over the Baseline. baseline is standard single-trajectory diffusion decoding from p0; full experimental details are p… view at source ↗

**Figure 5.** Figure 5: illustrates the distributional shift of verifier scores across denoising steps under S 3 search on MATH-500. Sequential resampling progressively concentrates particles in higher-reward regions compared to direct sampling from p0. The KDE plot (Figure 5a) shows the score distribution shifting rightward across steps, while the heatmap (Figure 5b) shows the particle score density concentrating toward higher-r… view at source ↗

read the original abstract

Test-time scaling investigates whether a fixed diffusion language model (DLM) can generate better outputs when given more inference compute, without additional training. However, naive best-of-$K$ sampling is fundamentally limited because it repeatedly draws from the same base diffusion distribution, whose high-probability regions are often misaligned with high-quality outputs. We propose $S^3$ (Stratified Scaling Search), a classical verifier-guided search method that improves generation by reallocating compute during the denoising process rather than only at the final output stage. At each denoising step, $S^3$ expands multiple candidate trajectories, evaluates them with a lightweight reference-free verifier, and selectively resamples promising candidates while preserving diversity within the search frontier. This procedure effectively approximates a reward-tilted sampling distribution that favors higher-quality outputs while remaining anchored to the model prior. Experiments with LLaDA-8B-Instruct on MATH-500, GSM8K, ARC-Challenge, and TruthfulQA demonstrate that $S^3$ consistently improves performance across benchmarks, achieving the largest gains on mathematical reasoning tasks while leaving the underlying model and decoding schedule unchanged. These results show that classical search over denoising trajectories provides a practical mechanism for test-time scaling in DLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S³ moves the search inside the denoising steps of a diffusion LM using a reference-free verifier to resample trajectories, but the abstract supplies no numbers, ablations, or verifier details to show the gains are real.

read the letter

The one or two things to know about this paper are that it introduces a search method operating inside the denoising steps of diffusion language models, using a verifier to guide which trajectories to continue, and that the empirical claims rest on very little shown evidence. The new part is the stratified scaling search that expands candidates at each step, scores them reference-free, and resamples while preserving diversity. This differs from standard best-of-K by intervening during the process rather than after. The paper does well by demonstrating the idea on an existing model like LLaDA-8B-Instruct across several benchmarks without any training changes, and by focusing on math tasks where such guidance might matter most. The soft spots are the lack of any quantitative results, ablations, or verifier details in the abstract. We don't see effect sizes, whether the gains are statistically significant, or how the verifier was built and whether its scores correlate with final accuracy. If the verifier is not predictive, this method doesn't add much beyond extra sampling. The assumption that a lightweight verifier can judge partial denoising trajectories without ground truth is central and untested here. This work is aimed at people exploring inference-time methods for diffusion LMs or test-time scaling in general. A reader in that area might get value from the concrete procedure once the full paper fills in the gaps. It deserves a serious referee because the core idea is implementable and addresses a real question in the field. I would recommend sending it to peer review, expecting the authors to add the necessary ablations and analysis of the verifier.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes $S^3$ (Stratified Scaling Search), a verifier-guided search procedure for test-time scaling in diffusion language models. At each denoising step the method expands multiple trajectories, scores them with a lightweight reference-free verifier, and resamples promising candidates while preserving diversity, thereby reallocating compute to approximate a reward-tilted distribution without altering the base DLM or decoding schedule. Experiments with LLaDA-8B-Instruct report consistent gains on MATH-500, GSM8K, ARC-Challenge and TruthfulQA, largest on mathematical reasoning tasks.

Significance. If the empirical claims hold, the work supplies a practical, model-agnostic mechanism for inference-time improvement in DLMs by classical search over intermediate denoising trajectories rather than final outputs only. This could be relevant for scaling compute in generative language models where high-probability regions of the diffusion prior are misaligned with task quality, especially on reasoning benchmarks.

major comments (2)

[Experiments] Experiments section: the abstract and reported results claim consistent improvements (largest on MATH-500) but supply no quantitative effect sizes, error bars, ablation tables against random selection or best-of-K, or statistical significance tests. This is load-bearing for the central claim that the verifier-guided procedure outperforms naive sampling, because without these controls it is impossible to determine whether gains survive increased diversity alone.
[Method] Method (§3): the description of the lightweight reference-free verifier provides no architecture, training data, scoring function, or correlation statistics between partial-trajectory scores and final correctness on MATH-500. Because the advantage over best-of-K rests entirely on the verifier reliably identifying promising trajectories without ground-truth access, the absence of these details leaves the key mechanism unvalidated.

minor comments (2)

[Method] Notation for the search frontier and resampling step is introduced without a compact pseudocode or equation block, making the precise reallocation rule difficult to reproduce from the text alone.
[Related Work] The paper does not cite prior verifier-guided search work in non-diffusion settings (e.g., process reward models or tree search in LLMs), which would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We value the feedback and will revise the manuscript to address the concerns raised regarding the experiments and method details. Our point-by-point responses are as follows.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract and reported results claim consistent improvements (largest on MATH-500) but supply no quantitative effect sizes, error bars, ablation tables against random selection or best-of-K, or statistical significance tests. This is load-bearing for the central claim that the verifier-guided procedure outperforms naive sampling, because without these controls it is impossible to determine whether gains survive increased diversity alone.

Authors: We agree with the referee that the current experiments section lacks the necessary quantitative details to fully support the claims. Specifically, we will add effect sizes (e.g., absolute and relative improvements), error bars from multiple independent runs, ablation tables comparing S³ against random selection of trajectories and standard best-of-K sampling, and statistical significance tests (e.g., paired t-tests or Wilcoxon tests) to demonstrate that the gains are not due to increased diversity alone. These additions will be included in the revised version of the paper. revision: yes
Referee: [Method] Method (§3): the description of the lightweight reference-free verifier provides no architecture, training data, scoring function, or correlation statistics between partial-trajectory scores and final correctness on MATH-500. Because the advantage over best-of-K rests entirely on the verifier reliably identifying promising trajectories without ground-truth access, the absence of these details leaves the key mechanism unvalidated.

Authors: We thank the referee for pointing out the insufficient details on the verifier. In the revised manuscript, we will provide a complete description of the lightweight reference-free verifier, including its architecture (a compact model trained to score denoising trajectories), the training data and procedure, the precise scoring function used at each step, and empirical correlation statistics between the partial-trajectory scores and the final answer correctness on MATH-500. This will substantiate the verifier's ability to identify promising paths without access to ground truth. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical search procedure with independent experimental validation

full rationale

The paper presents S³ as a classical verifier-guided search heuristic for reallocating compute during denoising in diffusion language models. No derivation chain, equations, or first-principles claims are advanced that reduce the method's outputs or performance gains to fitted parameters, self-defined quantities, or self-citations by construction. Performance improvements are asserted via direct experiments on MATH-500, GSM8K, ARC-Challenge, and TruthfulQA using an unmodified base model, without any renaming of known results, uniqueness theorems imported from prior author work, or ansatzes smuggled through citations. The central mechanism (trajectory expansion, reference-free scoring, and resampling) is described as an approximation to reward-tilted sampling but is not mathematically equated to its inputs; any gains are treated as empirical outcomes rather than tautological predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that intermediate denoising states contain sufficient signal for a lightweight verifier to steer search productively. No free parameters are explicitly named in the abstract, and no new physical or mathematical entities are introduced.

axioms (1)

domain assumption Intermediate states in the diffusion denoising process carry enough information for a reference-free verifier to rank trajectory quality.
This premise is required for the per-step evaluation and resampling step to improve final outputs rather than add noise.

pith-pipeline@v0.9.0 · 5554 in / 1434 out tokens · 45108 ms · 2026-05-10T20:19:46.378938+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

clean prediction

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2026

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

clean prediction

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2026