Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al · 2024

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 2 refs

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

Selective Off-Policy Reference Tuning with Plan Guidance

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.

Towards Effective Theory of LLMs: A Representation Learning Approach

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

cs.LG · 2026-05-13 · 2 refs

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

cs.CL · 2026-02-17

citing papers explorer

Showing 8 of 8 citing papers.

DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 40 · 2 links
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
Selective Off-Policy Reference Tuning with Plan Guidance cs.AI · 2026-05-12 · unverdicted · none · ref 36 · 2 links
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 16
A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 50
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation cs.AI · 2026-05-06 · unverdicted · none · ref 34
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 38
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective cs.LG · 2026-05-13 · unreviewed · ref 36 · 2 links
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens cs.CL · 2026-02-17 · unreviewed · ref 35

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer