Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
Reinforcement learning on pre-training data.arXiv preprint arXiv:2509.19249
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
Experiments indicate RL applied early in pre-training often matches full SFT-then-RL performance, targeted data composition outweighs scale for RL success, and averaging RL and SFT objectives outperforms sequential or single methods.
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
ARES generates 100K rubric-annotated QA instances from raw documents and demonstrates superior rubric-based RL performance over baselines on open-ended benchmarks.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
-
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.
-
RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training
Experiments indicate RL applied early in pre-training often matches full SFT-then-RL performance, targeted data composition outweighs scale for RL success, and averaging RL and SFT objectives outperforms sequential or single methods.
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
-
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning
ARES generates 100K rubric-annotated QA instances from raw documents and demonstrates superior rubric-based RL performance over baselines on open-ended benchmarks.