hub

American invitational mathematics examination (aime) 2024

Yifan Zhang, Team Math-AI · 2024

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

browse 10 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

dataset 4

citation-polarity summary

use dataset 4

representative citing papers

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.

Harnessing LLM Agents with Skill Programs

cs.AI · 2026-05-18 · conditional · novelty 6.0

HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.

Reasoning Compression with Mixed-Policy Distillation

cs.AI · 2026-05-09 · unverdicted · novelty 5.0

Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

citing papers explorer

Showing 10 of 10 citing papers.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation cs.LG · 2026-05-16 · unverdicted · none · ref 41
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
Learning from Language Feedback via Variational Policy Distillation cs.LG · 2026-05-14 · unverdicted · none · ref 46
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 81
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.
Harnessing LLM Agents with Skill Programs cs.AI · 2026-05-18 · conditional · none · ref 29
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models cs.CL · 2026-05-17 · unverdicted · none · ref 44
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards cs.CL · 2026-05-14 · unverdicted · none · ref 28
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information cs.LG · 2026-05-12 · unverdicted · none · ref 33
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport cs.LG · 2026-05-07 · unverdicted · none · ref 34 · 2 links
Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.
Reasoning Compression with Mixed-Policy Distillation cs.AI · 2026-05-09 · unverdicted · none · ref 31
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 63
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

American invitational mathematics examination (aime) 2024

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer