OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
arXiv preprint arXiv:2507.08649 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CL 3roles
background 1polarities
background 1representative citing papers
SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
-
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.