hub Mixed citations

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen · 2024 · DOI 10.18653/v1/2024.acl-long.211

Mixed citation behavior. Most common role is background (38%).

42 Pith papers citing it

Background 38% of classified citations

open at publisher browse 42 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

dataset 4 background 2 method 2

citation-polarity summary

background 3 use dataset 3 unclear 1 use method 1

representative citing papers

DART: Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets in Hybrid Reasoning Models

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

DART is a training-free router that accepts direct answers on draft agreement and allocates thinking budgets via draft entropy on disagreement, reporting accuracy gains and token reductions on math and code benchmarks across model scales.

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.

Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.

Beyond "I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty

cs.CL · 2026-04-19 · unverdicted · novelty 7.0

Frontier LLMs struggle to discriminate data uncertainty from model uncertainty even when accurate, but a new benchmark and lightweight RL strategy improve attribution without sacrificing answer accuracy.

Multi-objective Evolutionary Merging Enables Efficient Reasoning Models

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

Evo-L2S uses multi-objective evolutionary model merging to produce reasoning models that cut generated chain-of-thought length by over 50% while preserving or improving accuracy on math benchmarks.

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

cs.LG · 2025-07-02 · unverdicted · novelty 7.0

Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

RA-RFT trains a retriever to rank contexts by expected reasoning benefit and uses the retrieved analogies inside reinforcement fine-tuning, yielding 7.1 and 2.8 point gains on AIME 2025 over GRPO for two Qwen3 models.

Consolidating Rewarded Perturbations for LLM Post-Training

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

CoRP consolidates reward-weighted perturbations into a single model via low-rank structure, improving base LLMs by 8.1 points on average while using one-tenth the budget of prior ensembles and one forward pass.

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.

Holder Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.

Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

EvoTD applies crossover for skill composition and parametric mutation for complexity scaling, filtered by a Zone of Proximal Development, to generate tasks that improve LLM reasoning generalization across models.

Selective Off-Policy Reference Tuning with Plan Guidance

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

cs.LG · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

ExpThink reduces average CoT response length by up to 77% while improving accuracy on math benchmarks via experience-guided reward shaping and difficulty-adaptive advantage in RL.

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non-mathematical reasoning benchmarks.

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

cs.CL · 2025-12-08 · unverdicted · novelty 6.0

NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

cs.LG · 2025-12-05 · unverdicted · novelty 6.0

Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

cs.CL · 2025-07-21 · unverdicted · novelty 6.0

Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.

citing papers explorer

Showing 42 of 42 citing papers.

DART: Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets in Hybrid Reasoning Models cs.AI · 2026-06-22 · unverdicted · none · ref 16
DART is a training-free router that accepts direct answers on draft agreement and allocates thinking budgets via draft entropy on disagreement, reporting accuracy gains and token reductions on math and code benchmarks across model scales.
CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions cs.AI · 2026-06-02 · unverdicted · none · ref 17
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 12
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving cs.CL · 2026-04-23 · unverdicted · none · ref 126
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
Beyond "I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty cs.CL · 2026-04-19 · unverdicted · none · ref 4
Frontier LLMs struggle to discriminate data uncertainty from model uncertainty even when accurate, but a new benchmark and lightweight RL strategy improve attribution without sacrificing answer accuracy.
Multi-objective Evolutionary Merging Enables Efficient Reasoning Models cs.CL · 2026-04-07 · unverdicted · none · ref 3
Evo-L2S uses multi-objective evolutionary model merging to produce reasoning models that cut generated chain-of-thought length by over 50% while preserving or improving accuracy on math benchmarks.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling cs.LG · 2025-07-02 · unverdicted · none · ref 12
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning cs.CL · 2026-06-11 · unverdicted · none · ref 18
RA-RFT trains a retriever to rank contexts by expected reasoning benefit and uses the retrieved analogies inside reinforcement fine-tuning, yielding 7.1 and 2.8 point gains on AIME 2025 over GRPO for two Qwen3 models.
Consolidating Rewarded Perturbations for LLM Post-Training cs.CL · 2026-05-29 · unverdicted · none · ref 20
CoRP consolidates reward-weighted perturbations into a single model via low-rank structure, improving base LLMs by 8.1 points on average while using one-tenth the budget of prior ensembles and one forward pass.
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR cs.LG · 2026-05-19 · unverdicted · none · ref 9
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models cs.CL · 2026-05-17 · unverdicted · none · ref 47
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play cs.AI · 2026-05-16 · unverdicted · none · ref 25
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero cs.LG · 2026-05-14 · unverdicted · none · ref 25
GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning cs.AI · 2026-05-13 · unverdicted · none · ref 38
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
Holder Policy Optimisation cs.LG · 2026-05-12 · unverdicted · none · ref 11 · 2 links
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling cs.LG · 2026-05-12 · unverdicted · none · ref 1
EvoTD applies crossover for skill composition and parametric mutation for complexity scaling, filtered by a Zone of Proximal Development, to generate tasks that improve LLM reasoning generalization across models.
Selective Off-Policy Reference Tuning with Plan Guidance cs.AI · 2026-05-12 · unverdicted · none · ref 42 · 2 links
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression cs.LG · 2026-05-08 · unverdicted · none · ref 14 · 2 links
ExpThink reduces average CoT response length by up to 77% while improving accuracy on math benchmarks via experience-guided reward shaping and difficulty-adaptive advantage in RL.
Process Supervision of Confidence Margin for Calibrated LLM Reasoning cs.LG · 2026-04-25 · unverdicted · none · ref 22
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards cs.CL · 2026-04-20 · unverdicted · none · ref 44
PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non-mathematical reasoning benchmarks.
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping cs.LG · 2026-04-13 · unverdicted · none · ref 48
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning cs.CL · 2025-12-08 · unverdicted · none · ref 13
NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning cs.LG · 2025-12-05 · unverdicted · none · ref 6
Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR cs.CL · 2025-07-21 · unverdicted · none · ref 11
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning cs.LG · 2025-05-21 · unverdicted · none · ref 30
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 139
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
CausalMix: Data Mixture as Causal Inference for Language Model Training cs.LG · 2026-07-01 · unverdicted · none · ref 48
CausalMix fits a causal model on 512 runs of a 0.5B model to estimate CATE, then extrapolates optimal mixtures for an 800K data pool applied to 7B and 4B models, outperforming RegMix.
EntroRouter: Learning Efficient Model Routing via Entropy Regulation cs.CL · 2026-06-28 · unverdicted · none · ref 73
EntroRouter applies entropy regulation in a single-round routing framework to decouple reasoning from routing, retaining 98.3% of top expert accuracy at 48.25% lower compute cost.
When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff cs.LG · 2026-06-07 · unverdicted · none · ref 74
Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.
ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning cs.CL · 2026-06-05 · unverdicted · none · ref 34
ThinkBooster supplies a modular library, joint performance-efficiency benchmark, and deployable proxy for test-time compute scaling of LLM reasoning on math and coding tasks.
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning cs.CL · 2026-05-25 · unverdicted · none · ref 6
DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.
Language models fail at extended rule following cs.CL · 2026-05-03 · unverdicted · none · ref 44
LLMs fail at extended counting of repeated characters due to finite internal states, with abrupt errors persisting across model scales and inference methods.
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs cs.CL · 2026-05-01 · unverdicted · none · ref 21 · 2 links
MathArena is broadened into a maintained platform with new benchmarks for proofs, research questions, and formal verification, where GPT-5.5 scores 98% on 2026 USAMO and 74% on research-level tasks.
Efficient Test-Time Scaling via Temporal Reasoning Aggregation cs.AI · 2026-04-19 · unverdicted · none · ref 56
TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.
Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration cs.LG · 2025-08-19 · unverdicted · none · ref 5
DARS adaptively increases rollouts on hard problems in RLVR to improve Pass@K, and when paired with batch scaling for breadth, achieves gains in both Pass@K and Pass@1 by treating depth and breadth as complementary exploration dimensions.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 243
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale cs.CL · 2026-06-13 · unverdicted · none · ref 167
Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy cs.LG · 2026-06-07 · unverdicted · none · ref 2
Introduces Discrepancy-Constrained MDP (DCMDP) with Lagrangian relaxation to optimize LLM RL under train-inference discrepancy constraints, claiming performance gains on 8B and 30B models.
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges cs.CL · 2026-05-19 · unverdicted · none · ref 34
A literature survey synthesizing benchmarks, architectures, training strategies, and evaluation methods for mathematical reasoning in LLMs, based on roughly 120 papers.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving cs.CL · 2026-04-22 · unreviewed · ref 110
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning cs.AI · 2026-04-20 · unreviewed · ref 45
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning cs.AI · 2026-04-04 · unreviewed · ref 21

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer