pith. machine review for the scientific record. sign in

arxiv: 2601.09093 · v2 · submitted 2026-01-14 · 💻 cs.LG

Recognition: unknown

Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling

Authors on Pith no claims yet
classification 💻 cs.LG
keywords pruningreasoningsteptraceend-to-endlatencytracesevaluation
0
0 comments X
read the original abstract

Large Language Models (LLMs) can enhance reasoning capabilities through test-time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency. Prior work on accelerating this process has relied on similarity-based or confidence-based pruning, but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end-to-end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy. Our code is released at: https://github.com/Supercomputing-System-AI-Lab/STEP

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 conditional novelty 8.0

    AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...

  2. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 unverdicted novelty 7.0

    AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.

  3. Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling

    cs.LG 2026-04 unverdicted novelty 7.0

    Applying STP at consecutive semantic reasoning steps achieves 168x more accurate multi-step latent prediction on ProcessBench than frozen baselines, with trajectories forming smooth curves best captured by non-linear ...

  4. Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.