LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
Multiplex thinking: Reasoning via token-wise branch- and-merge.arXiv preprint arXiv:2601.08808
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
b1 is a plug-and-play post-training framework that trains diffusion LLMs to produce dynamic-size reasoning blocks by optimizing a monotonic entropy descent objective via reinforcement learning.
Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter reasoning chains.
TARPO is a pure RL framework using a token-wise action router to switch between discrete token generation and latent reasoning in LLMs, with joint optimization showing outperformance on benchmarks.
Select to Think reframes LLM help as ranking among SLM top-K candidates and distills the ranking ability back into the SLM for improved single-pass reasoning.
citing papers explorer
No citing papers match the current filters.