LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
Multiplex thinking: Reasoning via token-wise branch- and-merge.arXiv preprint arXiv:2601.08808
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
b1 is a plug-and-play post-training framework that trains diffusion LLMs to produce dynamic-size reasoning blocks by optimizing a monotonic entropy descent objective via reinforcement learning.
Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter reasoning chains.
Select to Think reframes LLM help as ranking among SLM top-K candidates and distills the ranking ability back into the SLM for improved single-pass reasoning.
citing papers explorer
-
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
-
CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning
CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
-
Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
b1 is a plug-and-play post-training framework that trains diffusion LLMs to produce dynamic-size reasoning blocks by optimizing a monotonic entropy descent objective via reinforcement learning.
-
Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning
Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter reasoning chains.
-
Select to Think: Unlocking SLM Potential with Local Sufficiency
Select to Think reframes LLM help as ranking among SLM top-K candidates and distills the ranking ability back into the SLM for improved single-pass reasoning.