2512.06201 , archiveprefix =

· 2025 · arXiv 2512.06201

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.

COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

cs.DC · 2026-04-29 · unverdicted · novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.

Reinforcement Learning from Human Feedback

cs.LG · 2025-04-16 · unverdicted · novelty 2.0

The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

citing papers explorer

Showing 5 of 5 citing papers.

GQA-{\mu}P: The maximal parameterization update for grouped query attention cs.LG · 2026-05-14 · unverdicted · none · ref 11
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 80
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning cs.AI · 2026-05-21 · unverdicted · none · ref 50
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training cs.DC · 2026-04-29 · unverdicted · none · ref 15
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unverdicted · none · ref 186
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

2512.06201 , archiveprefix =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer