super hub Baseline reference

Measuring Mathematical Problem Solving With the MATH Dataset

Akul Arora, Collin Burns, Dan Hendrycks, Eric Tang, Saurav Kadavath, Steven Basart · 2021 · cs.LG · arXiv 2103.03874

Baseline reference. 54% of citing Pith papers use this work as a benchmark or comparison.

402 Pith papers citing it

Baseline 54% of classified citations

open full Pith review browse 402 citing papers more from Akul Arora arXiv PDF

abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 47 background 32 method 1

citation-polarity summary

use dataset 43 background 32 unclear 4 use method 1

claims ledger

abstract Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are

authors

Akul Arora Collin Burns Dan Hendrycks Eric Tang Saurav Kadavath Steven Basart

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.

Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

cs.CL · 2026-04-08 · unverdicted · novelty 8.0

A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

cs.AI · 2026-03-30 · conditional · novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

cs.AI · 2021-08-31 · accept · novelty 8.0

MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

Search for Truth from Reasoning: A Dynamic Representation Editing Framework for Steering LLM Trajectories

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Elmes* automates fine-grained rubric construction for LLM educational evaluation via multi-agent interactions and a self-evolving SceneGen module, producing the Edu-330 benchmark that demonstrates multidimensional differences in model teaching performance.

Less is MoE: Trimming Experts in Domain-Specialist Language Models

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

STRIDE formulates TDA as sparse recovery using steering operators that mimic subset training effects in activation space, claiming SOTA LLM pre-training attribution at 13x prior speed.

Conformal Language Modeling via Posterior Sampling

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Conformal language modeling samples from posterior approximations conditioned on high-scoring regions to achieve risk control with higher utility than post-hoc filtering in open-ended text generation.

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

cs.AI · 2026-06-01 · conditional · novelty 7.0

2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

citing papers explorer

Showing 50 of 402 citing papers.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models cs.CL · 2022-01-28 · accept · none · ref 24 · internal anchor
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete cs.LG · 2026-06-01 · unverdicted · none · ref 57 · internal anchor
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions cs.CL · 2026-04-08 · unverdicted · none · ref 4 · internal anchor
A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data q-fin.CP · 2026-04-03 · conditional · none · ref 11 · internal anchor
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models cs.AI · 2026-04-02 · unverdicted · none · ref 8 · internal anchor
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology cs.AI · 2026-03-30 · conditional · none · ref 6 · internal anchor
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 119 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Training Software Engineering Agents and Verifiers with SWE-Gym cs.SE · 2024-12-30 · conditional · none · ref 3 · internal anchor
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 22 · internal anchor
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark cs.CL · 2024-06-27 · unverdicted · none · ref 20 · internal anchor
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics cs.AI · 2021-08-31 · accept · none · ref 5 · internal anchor
MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.
When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon cs.LG · 2026-06-29 · unverdicted · none · ref 24 · internal anchor
Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
Search for Truth from Reasoning: A Dynamic Representation Editing Framework for Steering LLM Trajectories cs.AI · 2026-06-26 · unverdicted · none · ref 27 · internal anchor
DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.
Tandem Reinforcement Learning with Verifiable Rewards cs.AI · 2026-06-26 · unverdicted · none · ref 7 · internal anchor
TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.
Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR cs.AI · 2026-06-23 · unverdicted · none · ref 19 · internal anchor
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios cs.LG · 2026-06-04 · unverdicted · none · ref 8 · internal anchor
Elmes* automates fine-grained rubric construction for LLM educational evaluation via multi-agent interactions and a self-evolving SceneGen module, producing the Edu-330 benchmark that demonstrates multidimensional differences in model teaching performance.
Less is MoE: Trimming Experts in Domain-Specialist Language Models cs.LG · 2026-06-04 · unverdicted · none · ref 13 · internal anchor
Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.
STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations cs.LG · 2026-06-03 · unverdicted · none · ref 50 · internal anchor
STRIDE formulates TDA as sparse recovery using steering operators that mimic subset training effects in activation space, claiming SOTA LLM pre-training attribution at 13x prior speed.
Conformal Language Modeling via Posterior Sampling cs.LG · 2026-06-02 · unverdicted · none · ref 15 · internal anchor
Conformal language modeling samples from posterior approximations conditioned on high-scoring regions to achieve risk control with higher utility than post-hoc filtering in open-ended text generation.
Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks cs.CR · 2026-06-02 · unverdicted · none · ref 20 · internal anchor
An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.
Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery cs.AI · 2026-06-01 · conditional · none · ref 16 · internal anchor
2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification cs.LG · 2026-05-31 · unverdicted · none · ref 13 · internal anchor
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training cs.CL · 2026-05-29 · unverdicted · none · ref 10 · internal anchor
D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting cs.LG · 2026-05-28 · unverdicted · none · ref 29 · internal anchor
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
Compositional Generalization in Autoregressive Models via Logit Composition cs.LG · 2026-05-27 · unverdicted · none · ref 10 · internal anchor
Logit composition of autoregressive models is projective under factorized conditionals, preserved under smooth reparameterizations, and maintains length generalization when assumptions hold uniformly.
RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data cs.LG · 2026-05-26 · unverdicted · none · ref 17 · internal anchor
ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games cs.AI · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling cs.LG · 2026-05-25 · unverdicted · none · ref 9 · internal anchor
ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.
CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning cs.LG · 2026-05-23 · unverdicted · none · ref 14 · internal anchor
CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation cs.LG · 2026-05-20 · conditional · none · ref 15 · internal anchor
X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 12 · internal anchor
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting cs.LG · 2026-05-19 · unverdicted · none · ref 19 · internal anchor
TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.
Learning How to Cube cs.LG · 2026-05-15 · unverdicted · none · ref 31 · internal anchor
A neuro-symbolic post-training pipeline lets a 4B transformer learn cubing heuristics that reach pass@5 of 53 on 100 SAT competition instances, matching the strongest symbolic baseline.
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding cs.CL · 2026-05-14 · unverdicted · none · ref 9 · internal anchor
FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights cs.CL · 2026-05-13 · unverdicted · none · ref 49 · internal anchor
TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.
Query-Conditioned Test-Time Self-Training for Large Language Models cs.CL · 2026-05-13 · conditional · none · ref 7 · 2 links · internal anchor
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
AIS: Adaptive Importance Sampling for Quantized RL stat.ML · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models cs.LG · 2026-05-10 · unverdicted · none · ref 36 · internal anchor
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM cs.CL · 2026-05-10 · unverdicted · none · ref 25 · internal anchor
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets cs.CR · 2026-05-10 · unverdicted · none · ref 24 · internal anchor
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 9 · internal anchor
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 17 · 2 links · internal anchor
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents cs.AI · 2026-05-08 · unverdicted · none · ref 5 · 2 links · internal anchor
AgentEscapeBench is a benchmark of 270 tasks across five difficulty tiers that measures LLM agents' ability to manage long-range tool dependencies, state tracking, and intermediate result propagation, revealing sharp performance drops with increasing depth.
KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning cs.CL · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting cs.CL · 2026-05-08 · unverdicted · none · ref 43 · 2 links · internal anchor
SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.
PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation cs.LG · 2026-05-08 · unverdicted · none · ref 20 · internal anchor
PropGuard is a propagation-aware framework for LLM-MAS that constructs dual-view spatio-temporal graphs, employs a GE-GRPO inspector to recover suspicious subgraphs, and applies source-guided remediation to lower attack success while preserving task performance.

Measuring Mathematical Problem Solving With the MATH Dataset

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer