pith. sign in

arxiv: 2512.07843 · v2 · pith:VVP5JLKRnew · submitted 2025-11-24 · 💻 cs.LG · cs.AI· cs.CL

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

classification 💻 cs.LG cs.AIcs.CL
keywords reasoningparallelmodelssequentialaccuracyinferencelatencyperformance
0
0 comments X
read the original abstract

Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but their inherently sequential decoding incurs substantial latency, motivating parallelization of the generation process. However, existing parallel reasoning approaches suffer from performance degradation compared to their sequential counterparts, and often rely on specialized inference engines. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that matches the accuracy of comparably sized sequential reasoning models while significantly reducing inference latency via three key innovations: 1) a two-stage parallel trajectory generator that produces high-quality parallel chain-of-thought data for supervised fine-tuning; 2) a trie-based rollout design that enables parallel reasoning on any off-the-shelf autoregressive inference engine; and 3) a parallelization-aware reinforcement learning framework that trains the model to balance reasoning accuracy with effective parallelization. Across six challenging math reasoning benchmarks, ThreadWeaver trained on top of Qwen3-8B achieves performance on par with cutting-edge sequential reasoning models (79.9% on AIME24 and 71.9% on average) while delivering up to 1.53x speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.

  2. Regulating Branch Parallelism in LLM Serving

    cs.DC 2026-05 unverdicted novelty 7.0

    TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while kee...

  3. Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers

    cs.LG 2026-06 unverdicted novelty 6.0

    LOTUS uses a looped padded Transformer with parallel cross-entropy supervision on gold CoT tokens to match explicit CoT performance at 3B parameters while reducing thought-phase latency 2.5x-6.9x.