pith. sign in

arxiv: 2604.25419 · v1 · submitted 2026-04-28 · 💻 cs.AI

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Pith reviewed 2026-05-07 16:15 UTC · model grok-4.3

classification 💻 cs.AI
keywords label-free RLVRmathematical reasoningLean verifiermajority votingResZero fallbackLLM reinforcement learninggeneralizationverifiable rewards
0
0 comments X

The pith

JURY-RL lets model rollouts vote on answers then rewards only those matching a formally verified plurality, reaching supervised-level math performance without labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents JURY-RL as a way to run reinforcement learning with verifiable rewards on reasoning tasks while eliminating the need for human-annotated correct answers. Model-generated rollouts first vote to propose a candidate answer; a formal verifier such as Lean then decides whether rollouts that produced that exact answer receive positive reward. When the verifier cannot confirm the voted answer, a fallback called ResZero assigns a neutral, zero-mean signal across the remaining rollouts instead of reinforcing the unverified majority. This separation prevents false-positive rewards from destabilizing training, which the authors show yields higher benchmark scores than other label-free methods and matches the accuracy of training on ground-truth labels.

Core claim

JURY-RL decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, ResZero (Residual-Zero) discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus.

What carries the argument

Decoupling of proposal via plurality votes from disposal via Lean verification, with ResZero as the neutral fallback signal when verification fails.

If this is right

  • JURY-RL outperforms other label-free baselines on mathematical reasoning benchmarks across three backbone models.
  • It reaches pass@1 accuracy comparable to supervised training that uses ground-truth labels.
  • It produces higher pass@k scores and greater response diversity, indicating stronger generalization.
  • Performance transfers competitively to code generation tasks and general language benchmarks.
  • The optimization gradient remains stable even on problems where formal verification returns inconclusive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same vote-then-verify pattern could be applied to any domain equipped with an automated checker, such as type-checked code or formally specified protocols.
  • Neutral redistribution of reward mass may help preserve exploration in other reinforcement-learning settings where consensus signals are noisy.
  • If the verifier is only sound but incomplete, models may still converge on correct but unprovable answers; testing against stronger provers would quantify this gap.
  • Combining JURY-RL with a small amount of verified seed data might further improve early training stability without restoring full annotation cost.

Load-bearing premise

The plurality-voted answer supplies an accurate positive reward once Lean verifies it, and the ResZero fallback redistributes a neutral signal without destabilizing gradients when verification remains inconclusive.

What would settle it

Training runs on the same math benchmarks where the ResZero component is removed or replaced by a constant zero reward; if final pass@1 and pass@k scores fall below the full JURY-RL results or below supervised baselines, the necessity of the neutral redistribution is refuted.

Figures

Figures reproduced from arXiv: 2604.25419 by Biao Fu, Dayiheng Liu, Guoxin Chen, Jing Wu, Minpeng Liao, Xinggao Liu, Xinjie Chen.

Figure 1
Figure 1. Figure 1: The JURY-RL workflow: Votes Propose, Proofs Dispose. For each problem, a majority vote from multiple rollouts (Jury) proposes a candidate answer. A Lean verifier (Judge) then disposes the reward. If the answer is Verified (δ = 1), supporting rollouts receive a positive reward, directly linking the update to successful formal verification. Conversely, when verification is Inconclusive (δ = 0), all rollouts … view at source ↗
Figure 2
Figure 2. Figure 2: Average unique an￾swers per sample over training steps on Qwen3-1.7B-Base. Consistently, view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy on MATH5000 Validation set over training steps. 5.3. Analysis Ablation Studies of ResZero. We compare four fallback designs under the same proof-gated framework when δ=0: Zero Reward (no signal), Random Reward (rewarding a ran￾domly selected candidate), MV Reward, and our proposed ResZero, with GT and Majority-Voting as references. As shown in view at source ↗
Figure 4
Figure 4. Figure 4: The pipeline of our Lean Verifier: a collaborative system that combines autoformalization, formal verification, and consistency checking to validate mathematical reasoning with high reliability. • Autoformalizer: Utilizes the Goedel-Formalizer-V2-32B model (Lin et al., 2025) to translate informal mathematical statements into formal Lean specifications. • Consistency-Checker: Implemented using QwQ-32B (Qwen… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics under different values of the hyperparameter c. G.2. Pass@k Results Main Results across Benchmarks. This appendix complements The Diversity Analysis of JURY-RL in Section 5.2. The complete numerical results are listed in view at source ↗
Figure 6
Figure 6. Figure 6: Test-Time Training Dynamics on Qwen2.5-7B. We plot exponentially moving-averaged (EMA) validation accuracy for Jury-RL (purple) and majority-voting TTRL (orange) over 400 test-time training steps. Jury-RL exhibits more stable and continued improvement on challenging benchmarks. 27 view at source ↗
Figure 7
Figure 7. Figure 7: Test-Time Training Dynamics on Qwen2.5-14B. We plot exponentially moving-averaged (EMA) validation accuracy for JURY-RL (purple) and majority-voting TTRL (orange) over 150 test-time training steps. JURY-RL exhibits more stable and continued improvement on challenging benchmarks view at source ↗
Figure 8
Figure 8. Figure 8: Training dynamics of precision, recall, validation accuracy, and training entropy under different Lean pass@k verification settings. \[ f\left(\frac{18}{7}, \frac{18}{7}\right) = ,→ \left(\frac{18}{7}\right)\left(\frac{18}{7}\right)(72 - ,→ 3\left(\frac{18}{7}\right) - 4\left(\frac{18}{7}\right)) \] \[ = \left(\frac{18}{7}\right)ˆ2 (72 - 7\left(\frac{18}{7}\right)) \] \[ = \left(\frac{18}{7}\right)ˆ2 (72 -… view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy on MATH5000 Validation set over training steps. have h1 : (1152 : R) ∈ {z : R | ∃ x y : R, x > 0 ∧ y > 0 ∧ 3*x + 4*y < 72 ∧ z = x * ,→ y * (72 - 3*x - 4*y)} := by refine’ ⟨8, 6, by norm_num, by norm_num, by norm_num, _⟩ <;> norm_num <;> ring_nf <;> norm_num have h2 : ∀ (z : R), z ∈ {z : R | ∃ x y : R, x > 0 ∧ y > 0 ∧ 3*x + 4*y < 72 ∧ z = x ,→ * y * (72 - 3*x - 4*y)} → z ≤ (1152 : R) := by intro z … view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces JURY-RL, a label-free RLVR framework for LLMs that decouples answer proposal (via plurality voting over model rollouts) from reward disposal (via formal Lean verification of the voted answer). Positive rewards are assigned only to matching rollouts when verification succeeds; otherwise ResZero discards the proposal and redistributes a claimed zero-mean, variance-preserving signal over residuals to preserve gradient stability. Across three backbone models trained on mathematical data, the method is claimed to outperform other label-free baselines on math reasoning benchmarks, achieve pass@1 comparable to supervised ground-truth training, and exhibit superior generalization via higher pass@k and response diversity, with competitive transfer to code generation and general benchmarks.

Significance. If the empirical claims and the ResZero modeling assumption hold, the work offers a practical route to scalable RLVR in machine-checkable domains without human labels or curated rewards, potentially lowering annotation costs while preserving or improving performance and generalization. The proposal-verification decoupling is conceptually clean and could generalize to other verifiable settings.

major comments (3)
  1. [Method (ResZero)] Method section (ResZero description): The claim that ResZero 'discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers' is load-bearing for the stability and unbiased-gradient assertions, yet the manuscript supplies neither a derivation of the zero-mean property nor an empirical check (e.g., mean/variance statistics or reward histograms on failed-verification rollouts). If residual answers are asymmetrically distributed around the unknown ground truth, the fallback can inject policy-dependent bias.
  2. [Experiments] Experiments section (quantitative claims): The headline results—'consistent outperformance,' 'pass@1 performance comparable to supervised ground-truth training,' and 'superior generalization demonstrated by higher pass@k and response diversity'—are stated without visible tables, error bars, ablation studies, or statistical significance tests in the provided abstract and summary. Full evaluation tables (including per-model, per-benchmark numbers and baseline comparisons) are required to substantiate the central empirical claim.
  3. [Experiments (transfer)] §4 (transfer experiments): The transfer results to code generation and general benchmarks are described as 'competitive' but lack detail on whether the same ResZero mechanism was used or whether verification was available in those domains; this affects the scope of the label-free claim.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'maintains a stable optimization gradient without reinforcing unverifiable consensus' is repeated from the method description; a brief parenthetical reference to the ResZero construction would improve readability.
  2. [Method] Notation: The paper should explicitly define the reward function r(·) and the ResZero redistribution operator in a single equation block for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method (ResZero)] Method section (ResZero description): The claim that ResZero 'discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers' is load-bearing for the stability and unbiased-gradient assertions, yet the manuscript supplies neither a derivation of the zero-mean property nor an empirical check (e.g., mean/variance statistics or reward histograms on failed-verification rollouts). If residual answers are asymmetrically distributed around the unknown ground truth, the fallback can inject policy-dependent bias.

    Authors: We acknowledge that the zero-mean and variance-preserving properties of ResZero require explicit support. In the revised manuscript we will add a short derivation in the Method section showing that the redistributed signal is constructed to have zero mean (by uniform allocation over residuals independent of the policy) and preserved variance (by scaling to match the original signal's second moment). We will also include empirical checks consisting of mean/variance statistics and reward histograms computed on failed-verification rollouts across training runs. On the bias concern, the redistribution is symmetric and external to the policy; verification failure is determined by the Lean oracle rather than model outputs. We will add a brief limitations paragraph noting the assumption and its scope. revision: yes

  2. Referee: [Experiments] Experiments section (quantitative claims): The headline results—'consistent outperformance,' 'pass@1 performance comparable to supervised ground-truth training,' and 'superior generalization demonstrated by higher pass@k and response diversity'—are stated without visible tables, error bars, ablation studies, or statistical significance tests in the provided abstract and summary. Full evaluation tables (including per-model, per-benchmark numbers and baseline comparisons) are required to substantiate the central empirical claim.

    Authors: The full manuscript already contains per-model and per-benchmark tables with baseline comparisons in Section 4. To address visibility and rigor, we will (i) move the primary results table into the main body, (ii) add error bars from multiple random seeds, (iii) include ablations on the voting threshold and ResZero components, and (iv) report statistical significance via paired tests on the key metrics. These updates will be incorporated in the revised version. revision: yes

  3. Referee: [Experiments (transfer)] §4 (transfer experiments): The transfer results to code generation and general benchmarks are described as 'competitive' but lack detail on whether the same ResZero mechanism was used or whether verification was available in those domains; this affects the scope of the label-free claim.

    Authors: We will expand §4 to clarify the transfer protocol. For code generation we replace Lean verification with unit-test oracles while retaining the identical JURY-RL proposal-verification decoupling and ResZero fallback. For general benchmarks lacking any verifier we apply the voting proposal with a neutral (zero) reward signal, preserving the label-free training objective. The revised text will explicitly state these adaptations and discuss their implications for the label-free claim. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper proposes a methodological framework (JURY-RL with ResZero fallback) rather than a mathematical derivation chain containing equations, predictions, or first-principles results. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The zero-mean property of ResZero is presented as an explicit design choice for the reward signal, not derived from or equivalent to the inputs by construction. Performance claims are framed as empirical outcomes from training experiments on external benchmarks, not reductions to the method's own definitions. The framework depends on an independent external verifier (Lean), keeping the central argument self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, background axioms, or new postulated entities beyond the named components of the framework itself.

pith-pipeline@v0.9.0 · 5543 in / 1159 out tokens · 77869 ms · 2026-05-07T16:15:11.394985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

  1. [15]

    Dickerson

    doi: 10.48550/ARXIV .2508.00410. URL https: //doi.org/10.48550/arXiv.2508.00410. Zhao, A., Wu, Y ., Yue, Y ., Wu, T., Xu, Q., Lin, M., Wang, S., Wu, Q., Zheng, Z., and Huang, G. Abso- lute zero: Reinforced self-play reasoning with zero data. CoRR, abs/2505.03335, 2025a. doi: 10.48550/ARXIV . 2505.03335. URLhttps://doi.org/10.48550/ arXiv.2505.03335. Zhao,...

  2. [16]

    The analysis should be based on the actual content of the text

    Math Assertion Analysis Identify all structurally and semantically relevant components of the mathematical problem, including variables, types, quantifiers, constraints, logic structure, conclusion, and so on. The analysis should be based on the actual content of the text

  3. [17]

    The analysis should reflect the actual content present in the Lean code

    Lean Statement Analysis (ignore proof part) Extract all structurally and semantically relevant components from the Lean statement, including variables, types, conditions, quantifiers, constraints, the final claim, and so on. The analysis should reflect the actual content present in the Lean code

  4. [18]

    - Preservation of constraints and boundary assumptions

    Comparative Verification Check for exact correspondence between the math and Lean statements; you may refer to aspects like: - Semantic alignment, logic structure, and quantifier correctness. - Preservation of constraints and boundary assumptions. - Accurate typing and use of variables. - Syntactic validity and proper Lean usage (free from errors). - Use ...

  5. [19]

    Final Judgement Based solely on the above analysis, judge whether the Lean statement is a correct and exact formalization of the mathematical problem

  6. [20]

    Proofs Dispose

    Accuracy Confirmation If correct: clearly confirm why all elements match. If incorrect: list all mismatches and explain how each one affects correctness. Note: While the analysis may be broad and open to interpreting all relevant features, the final judgment must be based only on what is explicitly and formally expressed in the Lean statement. **Do not co...