arxiv: 2602.03452 · v2 · submitted 2026-02-03 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

Yujuan Pang , Jiaxin Li , Xin Sheng , Ran Peng , Yong Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords RLVRprompt selectionpositive-negative pairingGRPOrare-event amplificationmath reasoningsample efficiencybidirectional signals

0 comments

The pith

Positive-negative prompt pairing with rare-event reweighting in GRPO improves sample efficiency in RLVR for math reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that variance-based prompt selection in RLVR creates unstable optimization because it lacks explicit negative signals from rare failures. Instead, it proposes selecting one hard-but-solvable prompt and one easy-but-brittle prompt at each update based on their empirical success rates across rollouts. Weighted GRPO then reweights binary outcomes at the pair level and applies group-normalized advantages to turn infrequent successes into strong positive guidance and infrequent failures into strong negative penalties. This bidirectional mechanism delivers stable learning signals while preserving exploration, leading to higher benchmark pass rates with only one paired minibatch per update. A reader would care because the method matches or exceeds performance of large-scale training that draws from over a thousand prompts.

Core claim

We propose positive-negative pairing: at each update, we sample a hard-but-solvable q+ and an easy-but-brittle prompt q- (high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on q+ into sharp positive guidance while turning rare failures on q- into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration.

What carries the argument

Positive-negative pairing of prompts selected by low and high empirical success rates under rollouts, combined with Weighted GRPO that applies pair-level reweighting and group-normalized advantages to amplify rare events.

If this is right

A single paired minibatch per update raises AIME 2025 Pass@8 from 16.8 to 22.2 and AMC23 Pass@64 from 94.0 to 97.0 on Qwen2.5-Math-7B compared with variance-based GRPO.
Performance remains competitive with RLVR trained from a pool of 1209 prompts.
Comparable gains appear on the Qwen2.5-Math-7B-Instruct model.
The bidirectional signals improve sample efficiency while avoiding suppression of exploration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pairing principle could be tested on non-math RLVR tasks such as code generation by identifying brittle prompts in those domains.
If the method reduces reliance on large prompt pools, it may enable RLVR training under tighter compute or data budgets.
Tracking success-rate variance inside each pair might offer an additional signal for deciding when to refresh the pair.

Load-bearing premise

Characterizing prompts by low and high empirical success rates under multiple rollouts reliably identifies hard-but-solvable and easy-but-brittle prompts that together supply stable bidirectional learning signals without suppressing exploration.

What would settle it

Running the paired method and the variance-based GRPO baseline on identical Qwen2.5-Math-7B models with the same number of rollouts and updates, then observing no improvement or a drop in AIME 2025 Pass@8 or AMC23 Pass@64, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.03452 by Jiaxin Li, Ran Peng, Xin Sheng, Yong Ma, Yujuan Pang.

**Figure 1.** Figure 1: Overview of bidirectional prompt selection and WGRPO. A common low-data RLVR baseline is to prioritize “high-variance” prompts, which can be sensitive to sampling noise. We instead select a two-prompt positive–negative pair: a hard prompt where rare successes provide strong positive guidance, and an easy prompt where rare failures provide strong negative penalties. WGRPO contrastively amplifies these tail … view at source ↗

**Figure 2.** Figure 2: Pass@k curves on AIME 2025, AMC23, and MATH500 for Qwen2.5-Math-7B with Base Model, GRPO+DSR-sub, GRPO+{π1, π2}, WGRPO+{π1209, p12}. Except training on the large dataset (GRPO+DSR-sub), WGRPO+{π1209, p12} overall outperforms other methods across different k. On AMC23 at k = 32, 64 and MATH500 at k = 8, 16, 32, 64, WGRPO+{π1209, p12} shows the strongest performance. 1 2 4 8 16 32 64 k (number of samples) 10… view at source ↗

**Figure 3.** Figure 3: Pass@k curves on AIME 2025, AMC23, and MATH500 for Qwen2.5-Math-7B-Instruct with Base Model, GRPO+DSR-sub, GRPO+{π1, π2}, WGRPO+{π1209, p12}. WGRPO+{π1209, p12} is comparable to other methods, but shows distinct gains on AIME 2025. Comparison to large-scale RLVR: competitive performance with 2 vs. 1209 training prompts. For Qwen2.5- Math-7B in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based only on training-accuracy variance, leading to unstable optimization directions and weaker transfer. We revisit prompt selection from a mechanism-level view and argue that an effective minibatch should provide both (i) a reliable positive anchor and (ii) explicit negative learning signals from rare failures. Based on this principle, we propose \emph{positive--negative pairing}: at each update, we sample a hard-but-solvable $q^{+}$ and an easy-but-brittle prompt $q^{-}$(high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on $q^{+}$ into sharp positive guidance while turning rare failures on $q^{-}$ into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration. On Qwen2.5-Math-7B, a single paired minibatch per update consistently outperforms a GRPO baseline that selects two prompts via commonly used variance-based selection heuristics: AIME~2025 Pass@8 improves from 16.8 to 22.2, and AMC23 Pass@64 from 94.0 to 97.0, while remaining competitive with large-scale RLVR trained from a pool of 1209 training prompts. Similar gains are observed on Qwen2.5-Math-7B-Instruct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The pairing and reweighting idea is concrete and shows reported gains, but the evidence for the mechanism working as described is still thin without verification that hard prompts are solvable.

read the letter

This paper's core idea is to move past variance-based prompt selection in RLVR by explicitly pairing a hard-but-solvable prompt with an easy-but-brittle one and reweighting their outcomes in GRPO to amplify rare successes and failures. The new part is the bidirectional pairing plus the pair-level reweighting in Weighted GRPO. It gives better sample efficiency, with reported gains on Qwen2.5-Math-7B where one pair per update beats a variance baseline on AIME 2025 and AMC23, and stays competitive with much larger prompt pools. They do a good job laying out the principle: reliable positive anchor from the hard prompt and negative signals from the brittle one. The empirical results on two models are concrete. But the soft spots are real. The stress test concern about misclassifying unsolvable prompts as q+ is worth checking. If the low success rate comes from finite rollouts on a truly impossible prompt, the positive term vanishes. The paper should verify that selected q+ actually get successes in training and report how the success rates correlate with solvability. Also, the abstract lacks error bars, ablations on the reweighting, or details on rollout numbers for pairing, which makes it hard to replicate or trust the source of the gains. For readers working on efficient training of reasoning LLMs, this could be useful if the mechanism checks out. It deserves a serious referee because the idea is novel enough and the benchmarks show promise, even if more rigorous validation is needed.

Referee Report

2 major / 2 minor

Summary. The paper claims that in RLVR for LLMs on deterministic reasoning tasks, variance-based prompt selection leads to unstable optimization, and proposes instead a positive-negative pairing mechanism: at each update, select a hard-but-solvable q+ (low empirical success rate under multiple rollouts) and an easy-but-brittle q- (high but imperfect success rate), then apply Weighted GRPO that reweights binary outcomes at the pair level with group-normalized advantages to amplify rare successes on q+ and rare failures on q-. This is said to supply stable bidirectional signals, improving sample efficiency. On Qwen2.5-Math-7B, a single paired minibatch per update yields AIME 2025 Pass@8 of 22.2 (vs. 16.8 for variance-based GRPO) and AMC23 Pass@64 of 97.0 (vs. 94.0), while remaining competitive with RLVR using a 1209-prompt pool; similar gains are reported for the Instruct variant.

Significance. If the empirical gains are attributable to the bidirectional mechanism rather than confounding factors, the work provides a concrete, mechanism-level improvement to prompt selection in RLVR that reduces reliance on large prompt pools while preserving exploration. The pair-level reweighting and group normalization in Weighted GRPO constitute a clear algorithmic contribution that could generalize to other outcome-based RL settings for reasoning models.

major comments (2)

[Positive-negative pairing and Weighted GRPO description] The core assumption that low empirical success rate reliably identifies a solvable q+ (so that rare-success amplification can occur) is load-bearing for the bidirectional-signal claim, yet the manuscript provides no verification that selected q+ prompts ever succeed during training or that the rollout-based rate correlates with actual solvability. Finite rollouts can misclassify unsolvable prompts as q+, causing the positive term in Weighted GRPO to vanish and reducing the update to negative-only pressure on q-.
[Experiments and results] The reported benchmark gains (AIME 2025 Pass@8: 16.8 → 22.2; AMC23 Pass@64: 94.0 → 97.0) are presented without error bars, ablations on the number of rollouts used for success-rate characterization, or statistical tests comparing the pairing rule against the variance-based baseline. This leaves open whether improvements stem from the claimed mechanism or from unstated factors such as the specific reweighting factors or prompt pool composition.

minor comments (2)

[Abstract] The abstract states that similar gains occur on Qwen2.5-Math-7B-Instruct but supplies no numerical results or comparison details for that model.
[Weighted GRPO] Notation for the pair-level reweighting factors and the precise definition of group-normalized advantages should be introduced with explicit equations rather than prose description only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below with clarifications and commit to revisions that directly strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Positive-negative pairing and Weighted GRPO description] The core assumption that low empirical success rate reliably identifies a solvable q+ (so that rare-success amplification can occur) is load-bearing for the bidirectional-signal claim, yet the manuscript provides no verification that selected q+ prompts ever succeed during training or that the rollout-based rate correlates with actual solvability. Finite rollouts can misclassify unsolvable prompts as q+, causing the positive term in Weighted GRPO to vanish and reducing the update to negative-only pressure on q-.

Authors: We agree that explicit verification is needed. All prompts are drawn from AIME/AMC problems, which are solvable by construction. Selection uses 8–16 rollouts per prompt to estimate success rates, retaining only those with low but strictly positive rates. To directly address potential misclassification, the revised manuscript will add a new analysis subsection that tracks, for every selected q+ during training, whether at least one success occurs in the subsequent training rollouts. We will report the fraction of q+ prompts that produce positive signals and the average number of successes per q+ minibatch, confirming that the positive term in Weighted GRPO remains active throughout training. revision: yes
Referee: [Experiments and results] The reported benchmark gains (AIME 2025 Pass@8: 16.8 → 22.2; AMC23 Pass@64: 94.0 → 97.0) are presented without error bars, ablations on the number of rollouts used for success-rate characterization, or statistical tests comparing the pairing rule against the variance-based baseline. This leaves open whether improvements stem from the claimed mechanism or from unstated factors such as the specific reweighting factors or prompt pool composition.

Authors: We accept that the current results lack statistical rigor. In the revision we will: (i) report means and standard deviations over at least three independent training runs with different random seeds; (ii) add an ablation table varying the number of rollouts (4, 8, 16) used for success-rate estimation and show its effect on final AIME/AMC scores; (iii) include paired t-tests with p-values comparing the positive-negative pairing method against the variance-based GRPO baseline; and (iv) explicitly state that the prompt pool is identical to the one used in the variance baseline and prior large-scale RLVR work, thereby ruling out pool-composition confounds. These additions will be placed in a new “Additional Experiments” subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method and gains are independent of fitted inputs

full rationale

The paper defines positive-negative pairing via a heuristic that selects q+ and q- from empirical success rates over multiple rollouts, then applies a proposed Weighted GRPO reweighting with group-normalized advantages. These are presented as a new selection rule and loss modification whose benefits are validated directly through benchmark experiments (AIME 2025, AMC23) against a variance-based GRPO baseline. No equations or derivations reduce the reported performance deltas to quantities defined by the same fitted parameters or self-citations; the central claims remain self-contained empirical outcomes rather than tautological restatements of the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that empirical success rates from multiple rollouts can be used to select informative prompt pairs and that group-normalized advantages will amplify rare events into useful gradients.

free parameters (1)

pair-level reweighting factors
Used to amplify rare successes on q+ and rare failures on q- in Weighted GRPO; exact values or selection procedure not stated in abstract.

axioms (1)

domain assumption Prompts can be reliably classified into hard-but-solvable and easy-but-brittle categories using empirical success rates from multiple rollouts
Invoked to define q+ and q- for the positive-negative pairing strategy.

pith-pipeline@v0.9.0 · 5615 in / 1314 out tokens · 31427 ms · 2026-05-16T08:21:05.531934+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 12 internal anchors

[1]

Asymmetric REINFORCE for off-policy reinforcement learning: Balancing positive and negative rewards

Arnal, C., Narozniak, G., Cabannes, V ., Tang, Y ., Kempe, J., and Munos, R. Asymmetric REINFORCE for off-policy reinforcement learning: Balancing positive and negative rewards. arXiv preprint arXiv:2506.20520,

work page arXiv
[2]

Aime problems and solutions

Art of Problem Solving. Aime problems and solutions. https://artofproblemsolving.com/wiki/index.php/AIME Problems and Solutions, 2025a. Accessed: 2025-04-20. Art of Problem Solving. AMC Problems and Solutions. https://artofproblemsolving.com/wiki/index.php?title= AMC Problems and Solutions, 2025b. Accessed: 2025- 04-20. Cai, X.-Q., Wang, W., Liu, F., Liu,...

work page arXiv 2025
[3]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Petroski Such, F., Cummings, D., Plappert, M., C...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Stepwise guided policy optimization: Coloring your incorrect rea- soning in GRPO

Chen, P., Li, X., Li, Z., Chen, X., and Lin, T. Stepwise guided policy optimization: Coloring your incorrect rea- soning in GRPO. arXiv preprint arXiv:2505.11595,

work page arXiv
[5]

Das, N., Chakraborty, S., Pacchiano, A., and Chowdhury, S. R. Active preference optimization for sample-efficient rlhf.arXiv preprint arXiv:2402.10500,

work page arXiv
[6]

Don’t waste mistakes: Leveraging negative RL- groups via confidence reweighting

Feng, Y ., Jain, P., Hartshorn, A., Duan, Y ., and Kempe, J. Don’t waste mistakes: Leveraging negative RL- groups via confidence reweighting. arXiv preprint arXiv:2510.08696,

work page arXiv
[7]

On designing effective rl reward at training time for llm reasoning

Gao, J., Xu, S., Ye, W., Liu, W., He, C., Fu, W., Mei, Z., Wang, G., and Wu, Y . On designing effective rl reward at training time for llm reasoning. arXiv preprint arXiv:2410.15115,

work page arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

W., and Dasigi, P

Ivison, H., Zhang, M., Brahman, F., Koh, P. W., and Dasigi, P. Large-scale data selection for instruction tuning.arXiv preprint arXiv:2503.01807,

work page arXiv
[10]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Kazemnejad, A., Aghajohari, M., Portelance, E., Sordoni, A., Reddy, S., Courville, A., and Roux, N. L. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment. arXiv preprint arXiv:2410.01679,

work page arXiv
[12]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team, Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., Lyu, S., et al. T ¨ULU 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

P., Li, L., and Li, Y

Li, J., Zhou, P., Meng, R., Vadera, M. P., Li, L., and Li, Y . Turn-ppo: Turn-level advantage estimation with ppo for improved multi-turn rl in agentic llms. arXiv preprint arXiv:2512.17008, 2025a. 9 Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing Li, W. and Li, Y . Process reward model with q-value rank- ings...

work page arXiv
[15]

Limr: Less is more for rl scaling

Li, X., Zou, H., and Liu, P. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886, 2025b. Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. arXiv preprint arXiv:2305.20050,

work page arXiv
[16]

Enabling weak llms to judge response reliability via meta ranking.arXiv preprint arXiv:2402.12146,

Liu, Z., Kou, B., Li, P., Yan, M., Zhang, J., Huang, F., and Liu, Y . Enabling weak llms to judge response reliability via meta ranking.arXiv preprint arXiv:2402.12146,

work page arXiv
[17]

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

M., Straube, J., Basra, M., Pazdera, A., Thaman, K., Ferrante, M

Prime Intellect Team, Jaghouar, S., Mattern, J., Ong, J. M., Straube, J., Basra, M., Pazdera, A., Thaman, K., Ferrante, M. D., Gabriel, F., Obeid, F., Erdem, K., Keiblinger, M., and Hagemann, J. Intellect-2: A reasoning model trained through globally decentralized reinforcement learning. arXiv preprint arXiv:2505.07291,

work page arXiv
[19]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseek- math: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexible and efficient RLHF framework. arXiv preprint arXiv:2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Wang, Y ., Yang, Q., Zeng, Z., Ren, L., Liu, L., Peng, B., Cheng, H., He, X., Wang, K., Gao, J., Chen, W., Wang, S., Du, S. S., and Shen, Y . Reinforcement learning for reasoning in large language models with one training example. arXiv preprint arXiv:2504.20571,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Wei, Z., Yang, X., Sun, K., Wang, J., Shao, R., Chen, S., Kachuee, M., Gollapudi, T., Liao, T., Scheffer, N., Wanga, R., Kumar, A., Meng, Y ., tau Yih, W., and Dong, X. L. Truthrl: Incentivizing truthful llms via reinforce- ment learning. arXiv preprint arXiv:2509.25760,

work page arXiv
[24]

Scalable chain of thoughts via elastic reasoning

Xu, Y ., Dong, H., Wang, L., Sahoo, D., Li, J., and Xiong, C. Scalable chain of thoughts via elastic reasoning. arXiv preprint arXiv:2505.05315,

work page arXiv
[25]

Unearthing gems from stones: Policy optimization with negative sample augmentation for LLM reasoning

Yang, Z., Ye, Y ., Jiang, S., Hu, C., Li, L., Deng, S., and Jiang, D. Unearthing gems from stones: Policy optimization with negative sample augmentation for LLM reasoning. arXiv preprint arXiv:2505.14403,

work page arXiv
[26]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yuan, Y ., Yu, Q., Zuo, X., Zhu, R., Xu, W., Chen, J., Wang, C., Fan, T., Du, Z., Wei, X., et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2504.05118, 2025a. Yuan, Y ., Yue, Y ., Zhu, R., Fan, T., and Yan, L. What’s behind ppo’s collapse in long-cot? value optimization holds the secret. arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Derivation of closed-form rare-event amplification Proposition A.1.Let k be the number of correct responses in a group of size G, and let p=k/G be the group success rate

10 Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing A. Derivation of closed-form rare-event amplification Proposition A.1.Let k be the number of correct responses in a group of size G, and let p=k/G be the group success rate. For0< k < G, suppose the outcome mapping is yi = ( +1,if responseo i is correct, −λneg...

work page 2025
[29]

A key observation is that, onceλ neg is moderately large (e.g.,≥50in our sweep), results vary only slightly across a broad range of values

Max context length 4096 Learning rate1×10 −6 Group size (Gresponses per prompt) 8 Max training steps 500 Hardware budget≤8 H100 GPUs Optimizer AdamW (verl default) Adamβ 1, β2 0.9, 0.95 Weight decay 0.01 LR scheduler / warmup none Gradient clipping 1 Max generation tokens 3072 Reward format binary verifiable reward (exact match) KL coefficient 0.001 Negat...

work page 2025
[30]

All other pairs are selected by our bidirectional prompt selection method (Section 3.2) and trained with WGRPO

and trained with GRPO. All other pairs are selected by our bidirectional prompt selection method (Section 3.2) and trained with WGRPO. Bold and underlined numbers denote the best and second-best results for eachk(within each dataset). Prompt set Pass@k k1 2 4 8 16 32 64 AIME 2025 {π1, π2}(baseline) 3.6 6.5 11.0 16.8 23.8 32.2 41.3 {π1, π3}(baseline) 4.1 7...

work page 2025
[31]

They are denoted as $f’’(x)$, $f’’’(x)$, $fˆ{(4)}(x)$, ..., $fˆ{(n)}(x)$

Prompt: Define the derivative of the $(n-1)$th derivative as the $n$th derivative $(n \\in Nˆ{ *}, n \\geqslant 2)$, that is, $fˆ{(n)}(x)=[fˆ{(n-1)}(x)]’$. They are denoted as $f’’(x)$, $f’’’(x)$, $fˆ{(4)}(x)$, ..., $fˆ{(n)}(x)$. If $f(x) = xeˆ{x}$, then the $2023$rd derivative of the function $f(x)$ at the point $(0, fˆ{(2023)}(0))$ has a $y$-intercept o...

work page 2023
[32]

Exactly one of those regions has finite area

Prompt: The set of points in 3-dimensional coordinate space that lie in the plane $x+y+z=75$ whose coordinates satisfy the inequalities $x-yz<y-zx<z-xy$ forms three disjoint convex regions. Exactly one of those regions has finite area. The area of this finite region can be expressed in the form $a\\sqrt{b}$, where $a$ and $b$ are positive integers and $b$...

work page 2025
[33]

Points $C$ and $D$ lie on $\\omega_2$ such that $\\overline{BC}$ is a diameter of $\\omega_2$ and $\\overline{BC} \\perp \\overline{AD}$. The rectangle $EFGH$ is inscribed in $\\omega_1$ such that $\\overline{EF} \\perp \\overline{BC}$, $C$ is closer to $\\overline{GH}$ than to $\\overline{EF}$, and $D$ is closer to $\\overline{FG}$ than to $\\overline{EH...

work page 2025
[34]

Let’s think step by step and output the final answer within \\boxed{}

17 Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing Prompt: Let $ A_1A_2 \\ldots A_{11} $ be an 11-sided non-convex simple polygon with the following properties:\n * The area of $ A_iA_1A_{i+1} $ is 1 for each $ 2 \\leq i \\leq 10 $,\n * $ \\cos(\\angle A_iA_1A_{i+1}) = \\frac{12}{13} $ for each $ 2 \\leq i \\l...

work page 2025
[35]

Find the sum of these three values of $ k $

Prompt: There are exactly three positive real numbers $ k $ such that the function\n $ f(x) = \\frac{(x - 18)(x - 72)(x - 98)(x - k)}{x} $\ndefined over the positive real numbers achieves its minimum value at exactly two positive real numbers $ x $. Find the sum of these three values of $ k $. Let’s think step by step and output the final answer within \\...

work page 2025