A key observation is that, onceλ neg is moderately large (e.g.,≥50in our sweep), results vary only slightly across a broad range of values

Max context length 4096 Learning rate1×10 −6 Group size (Gresponses per prompt) 8 Max training steps 500 Hardware budget≤8 H100 GPUs Optimizer AdamW (verl default) Adamβ 1, β2 0 · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

cs.LG · 2026-02-03 · unverdicted · novelty 7.0

Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.

citing papers explorer

Showing 1 of 1 citing paper.

Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing cs.LG · 2026-02-03 · unverdicted · none · ref 29
Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.

A key observation is that, onceλ neg is moderately large (e.g.,≥50in our sweep), results vary only slightly across a broad range of values

fields

years

verdicts

representative citing papers

citing papers explorer