Canonical reference

Title resolution pending

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author= · 2024

Canonical reference. 100% of citing Pith papers cite this work as background.

36 Pith papers citing it

Background 100% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.

Deep Variational Inference Symbolic Regression

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

DVISR performs variational inference over symbolic expression trees and constants by training a neural network with the ELBO as reward, recovering true posteriors in simple test cases.

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

cs.SE · 2025-02-25 · unverdicted · novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

cs.CL · 2024-10-10 · conditional · novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

GoLongRL releases a 23K-sample open long-context RL dataset spanning 9 tasks and introduces TMN-Reweight to improve multitask optimization, achieving performance comparable to much larger models under GRPO.

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

AutoVecCoder combines VecPrompt for automated intrinsic knowledge synthesis and VecRL for efficiency-aligned RL to train an 8B LLM that achieves SOTA on SimdBench SSE/AVX subsets and sometimes exceeds -O3 compiler results.

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.

GAGPO: Generalized Advantage Grouped Policy Optimization

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

GAGPO computes step-aligned temporal advantages from grouped rollout samples without a learned critic, enabling stable policy optimization in multi-turn agent environments.

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.

Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning

cs.GT · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.

Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

VIGOR assigns higher rewards to LLM completions that produce smaller l2 norms of teacher-forced negative log-likelihood gradients, with sqrt(T) length correction and group ranking, yielding +3.31% math and +1.91% code gains over RLIF on Qwen2.5-7B.

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

cs.CL · 2026-05-09 · conditional · novelty 6.0 · 2 refs

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.

Rotation-Preserving Supervised Fine-Tuning

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.

Mitigating Cognitive Bias in RLHF by Altering Rationality

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.

Milestone-Guided Policy Learning for Long-Horizon Language Agents

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

cs.AI · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.

Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

CAL-GRPO calibrates per-attempt weights in multi-attempt CoT to deliver unbiased gradients for optimizing Verification@K success while keeping variance low.

citing papers explorer

Showing 36 of 36 citing papers.

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions cs.CL · 2026-05-13 · unverdicted · none · ref 12
AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration cs.AI · 2026-05-07 · unverdicted · none · ref 7
Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.
Deep Variational Inference Symbolic Regression cs.LG · 2026-05-01 · unverdicted · none · ref 20
DVISR performs variational inference over symbolic expression trees and constants by training a neural network with the ELBO as reward, recovering true posteriors in simple test cases.
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity cs.LG · 2026-05-01 · unverdicted · none · ref 2
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation cs.CL · 2026-04-21 · unverdicted · none · ref 14
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution cs.SE · 2025-02-25 · unverdicted · none · ref 192
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models cs.CL · 2024-10-10 · conditional · none · ref 31
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR cs.LG · 2026-05-20 · unverdicted · none · ref 2
Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment cs.CL · 2026-05-19 · unverdicted · none · ref 31
GoLongRL releases a 23K-sample open long-context RL dataset spanning 9 tasks and introduces TMN-Reweight to improve multitask optimization, achieving performance comparable to much larger models under GRPO.
AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code cs.CL · 2026-05-18 · unverdicted · none · ref 20
AutoVecCoder combines VecPrompt for automated intrinsic knowledge synthesis and VecRL for efficiency-aligned RL to train an 8B LLM that achieves SOTA on SimdBench SSE/AVX subsets and sometimes exceeds -O3 compiler results.
GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero cs.LG · 2026-05-14 · unverdicted · none · ref 7
GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.
GAGPO: Generalized Advantage Grouped Policy Optimization cs.CL · 2026-05-13 · unverdicted · none · ref 6
GAGPO computes step-aligned temporal advantages from grouped rollout samples without a learned critic, enabling stable policy optimization in multi-turn agent environments.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 13 · 2 links
ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning cs.GT · 2026-05-11 · unverdicted · none · ref 48 · 2 links
Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.
Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward cs.LG · 2026-05-11 · unverdicted · none · ref 10
VIGOR assigns higher rewards to LLM completions that produce smaller l2 norms of teacher-forced negative log-likelihood gradients, with sqrt(T) length correction and group ranking, yielding +3.31% math and +1.91% code gains over RLIF on Qwen2.5-7B.
Crosslingual On-Policy Self-Distillation for Multilingual Reasoning cs.CL · 2026-05-10 · unverdicted · none · ref 32
COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 56
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation cs.CL · 2026-05-09 · conditional · none · ref 3 · 2 links
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.
Rotation-Preserving Supervised Fine-Tuning cs.LG · 2026-05-08 · unverdicted · none · ref 57
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning cs.AI · 2026-05-08 · unverdicted · none · ref 35
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
Mitigating Cognitive Bias in RLHF by Altering Rationality cs.AI · 2026-05-07 · unverdicted · none · ref 33
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
Milestone-Guided Policy Learning for Long-Horizon Language Agents cs.CL · 2026-05-07 · unverdicted · none · ref 13
BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling cs.AI · 2026-05-04 · unverdicted · none · ref 13 · 2 links
APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.
Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought cs.LG · 2026-04-20 · unverdicted · none · ref 6
CAL-GRPO calibrates per-attempt weights in multi-attempt CoT to deliver unbiased gradients for optimizing Verification@K success while keeping variance low.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 34
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
torchtune: PyTorch native post-training library cs.LG · 2026-05-20 · unverdicted · none · ref 27
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR) cs.LG · 2026-05-13 · unverdicted · none · ref 32
RL post-training lifts answer correctness on FHIR-AgentBench from 50% (o4-mini) to 77% with a cheaper Qwen3-8B CodeAct agent.
DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 9
DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.
GIFT: Guided Fine-Tuning and Transfer for Enhancing Instruction-Tuned Language Models cs.CL · 2026-05-02 · unverdicted · none · ref 22
GIFT guides adapter fine-tuning on base models with confidence signals from instruction-tuned models before merging, yielding task-specialized models that outperform direct fine-tuning on math and knowledge benchmarks.
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges cs.CL · 2026-05-19 · unverdicted · none · ref 89
A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.
AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning cs.LG · 2026-05-18 · unreviewed · ref 25
Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning cs.AI · 2026-05-13 · unreviewed · ref 57
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning cs.LG · 2026-05-12 · unreviewed · ref 11
Macro: Enhancing Multilingual Counterfactual Explanations through Alignment-as-Preference Optimization cs.CL · 2026-05-12 · unreviewed · ref 37
Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents cs.CL · 2026-04-20 · unreviewed · ref 9
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models cs.AI · 2026-04-19 · unreviewed · ref 45

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer