pith. sign in

arxiv: 2605.29782 · v1 · pith:D3ZMZJXCnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CL

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Pith reviewed 2026-06-29 08:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords state value estimationLLM reinforcement learningPPOcritic collapseNumcaHistaSVEBhidden states
0
0 comments X

The pith

Numca and Hista improve state value estimates in LLM reinforcement learning by countering critic collapse to group averages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard methods like PPO for refining large language models via reinforcement learning produce inaccurate state value estimates because their critics collapse to coarse group-average baselines. It introduces the SVEB benchmark to quantify this problem and then presents two fixes. Numca treats numerical spans inside model outputs as gradable milestones for value estimation. Hista instead uses the model's own hidden states to compute a weighted average over disjoint rollouts and their returns. If correct, these changes deliver more accurate estimates, stronger training performance across algorithms and model sizes, and no meaningful extra compute.

Core claim

Critics in standard PPO-style RL for LLMs collapse to coarse group-average baselines. Numca leverages numerical spans as gradable milestones for state value estimation. Hista uses the LLM's hidden states as a representation to form a weighted average over disjoint rollouts and their returns. Both yield more accurate state value estimates and improve training performance across RL algorithms and model sizes without significant computational overhead.

What carries the argument

Numca and Hista: Numca uses numerical spans inside responses as gradable milestones; Hista uses hidden-state representations to weight-average returns from disjoint rollouts.

If this is right

  • More accurate state value estimates are obtained on the SVEB benchmark.
  • Training performance improves across multiple RL algorithms.
  • Gains appear for models of different sizes.
  • No significant computational overhead is added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hidden states may carry value-relevant information that could be exploited in other LLM post-training settings beyond the tested RL algorithms.
  • The numerical-span idea in Numca could extend to other discrete or graded signals present in model outputs.
  • If the collapse problem is as central as claimed, similar value-estimation fixes might stabilize RL fine-tuning on tasks with sparser rewards.
  • The SVEB benchmark itself could serve as a diagnostic tool for comparing future value-estimation proposals.

Load-bearing premise

The observed collapse of critics to group-average baselines is the main bottleneck in LLM RL state value estimation, and Numca and Hista directly mitigate it without introducing new failure modes.

What would settle it

An experiment applying Numca or Hista to the same RL setups that shows no improvement in SVEB accuracy scores or downstream training performance relative to the original PPO-style baselines.

Figures

Figures reproduced from arXiv: 2605.29782 by Garry Yang, James Cheng, Jiqian Dong, Yizhou Tian, Yongqiang Chen, Zhitang Chen, Zizhe Chen.

Figure 1
Figure 1. Figure 1: a. Credit assignment of Hista. b. Apply Hista to DAPO on Qwen2.5 base models, where each color represents one settings. Applying Hista achieves better convergence speed and reward. core limitation inherited from GRPO: treating entire re￾sponse as single action and apply the same state value to every token. This simplification stems directly from the difficulty of parsing intermediate states. More sophisti￾… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of the difference between state value pre￾dicted by PPO with λ = 1.0 and the group averaged reward. value definition and training objectives in Section 3.2. Both the actor and critic are initialized from Qwen2.5-1.5B. We construct the SVEB-NUMBER dataset by sampling roughly 1,000 math problems and 5,000 intermediate states from DAPO-17K (Yu et al., 2025) (more details in Section 6.1). The SVEB… view at source ↗
Figure 3
Figure 3. Figure 3: Compare reward on train set (left) and validation set (right) of PPO, GRPO, and GRPO + Numca during training: a. Qwen2.5-1.5B-Instruct; b. Qwen2.5-1.5B-Math-Instruct. racy, whereas GRPO and PPO suffer from instability. This is primarily because finer-grained and more accurate credit as￾signment helps reduce the variance in the gradient. Finally, we select the checkpoint with the best validation accuracy an… view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline of Hista. It takes the last layer hidden state and pools them to state representation. Then, MinDist operation is applied to calculate distance between two states, which is weighted averaged to get the state value estimation. Finally, the estimated state value is treated as baseline to calculate advantage. Definition 5.3. The average estimator: Vˆ avg(st) = 1 N PN i=1 ri , which is an approximatio… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics of Qwen3-1.7B and 4B base model on DAPO-17K dataset and AIME24 validation set [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of the difference between state value pre￾dicted by Hista and the group averaged reward. The x-axis is the difference and y-axis is the number of prediction. D. SVEB on Different Models Previous research has discovered the special shape and prop￾erty of LLM or Transformer model hidden states (Godey et al., 2024; Rudman et al., 2023), which raise the concern about richness or quality of hidden … view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of the difference between state value pre￾dicted by Numca and the group averaged reward. The x-axis is the difference and y-axis is the number of prediction. The [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
read the original abstract

Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces the State Value Estimation Benchmark (SVEB) to evaluate state value estimation in LLM RL frameworks, demonstrating that critics in standard methods like PPO collapse to coarse group-average baselines. It proposes Numca, which uses numerical spans as gradable milestones, and Hista, which leverages LLM hidden states to weighted average disjoint rollouts and their returns. The paper claims that these methods yield more accurate state value estimates and enhance training performance across RL algorithms and model sizes without significant computational overhead.

Significance. If the experimental results hold, the work addresses an important underexplored challenge in LLM post-training by providing more accurate value estimates, potentially leading to more stable and effective RL fine-tuning. The SVEB benchmark could serve as a useful tool for the community.

major comments (1)
  1. [Abstract] The abstract asserts extensive experimental results showing improved accuracy and performance but provides no data, error bars, baselines, or details on the methods or SVEB, making the central claims impossible to evaluate from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts extensive experimental results showing improved accuracy and performance but provides no data, error bars, baselines, or details on the methods or SVEB, making the central claims impossible to evaluate from the given text.

    Authors: Abstracts are intentionally concise high-level summaries and do not contain numerical results, error bars, or full method details; this is standard practice to respect length constraints. The full manuscript contains the requested information: Section 3 defines SVEB with its construction and metrics; Sections 4 and 5 detail Numca and Hista; Section 6 reports all experiments with tables and figures that include baselines (PPO, GRPO, etc.), error bars across multiple seeds and model sizes, and quantitative accuracy/performance gains. The central claims are therefore fully evaluable from the complete paper. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract introduces SVEB, describes collapse of critics to group-average baselines, and proposes Numca and Hista at a conceptual level with no equations, derivations, or parameter-fitting steps shown. No self-definitional relations, fitted inputs renamed as predictions, or self-citation chains appear in the supplied text. The full manuscript is referenced but not provided, preventing inspection of any potential load-bearing steps; therefore the derivation chain cannot be walked and no circularity is detectable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are identifiable or stated.

pith-pipeline@v0.9.1-grok · 5693 in / 1075 out tokens · 28767 ms · 2026-06-29T08:34:54.852341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Offline Reinforcement Learning with Implicit Q-Learning

    URL https://proceedings.neurips. cc/paper_files/paper/1999/file/ 6449f44a102fde848669bdd9eb6b76fa-Paper. pdf. Kostrikov, I., Nair, A., and Levine, S. Offline reinforcement learning with implicit q-learning, 2021. URL https: //arxiv.org/abs/2110.06169. Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conserva- tive q-learning for offline reinforcement learn...

  2. [2]

    URL https://www.primeintellect.ai/ blog/synthetic-1-release. MiniMax, :, Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., Xiao, C., Du, C., Zhang, C., Qiao, C., Zhang, C., Du, C., Guo, C., Chen, D., Ding, D., Sun, D., Li, D., Jiao, E., Zhou, H., Zhang, H., Ding, H., Sun, H., Feng, H., Cai, H., Zhu, H., Sun, J....

  3. [3]

    Proximal Policy Optimization Algorithms

    URL https://aclanthology.org/2023. emnlp-main.901/. Schaul, T., Horgan, D., Gregor, K., and Silver, D. Uni- versal value function approximators. In Bach, F. and Blei, D. (eds.),Proceedings of the 32nd Interna- tional Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pp. 1312–1320, Lille, France, 07–09 Jul 2015. PMLR. UR...

  4. [4]

    Assumption A.1(Answer Parsing).The final answer for both states is parsed from one action a, which corresponds to one token

    = ηX i=1 ∥xl 1,i −x l 2,i∥2.(2) To analyze the trajectory of these representations within this simplified setting, we introduce two foundational as- sumptions regarding the generation dynamics and latent boundaries beyond the prefix sequence. Assumption A.1(Answer Parsing).The final answer for both states is parsed from one action a, which corresponds to ...

  5. [5]

    Any vector v⊥ orthogonal to x (i.e., xT v⊥ = 0) is an eigenvector with eigenvalueλ ⊥ = 1

  6. [6]

    Thus, the corresponding eigenvalue isλ ∥ = dϵ ∥x∥2 2+dϵ

    The vectorv ∥ =xitself is an eigenvector, yielding: Id − xxT d r(x)2 x=x− ∥x∥2 2 d r(x)2 x = 1− ∥x∥2 2 ∥x∥2 2 +dϵ x= dϵ ∥x∥2 2 +dϵ x. Thus, the corresponding eigenvalue isλ ∥ = dϵ ∥x∥2 2+dϵ. Since ϵ >0 , the eigenvalues lie strictly within the interval (0,1] , and the maximum eigenvalue is exactly 1 (associated with the orthogonal subspace). Therefore, th...

  7. [7]

    Suppose we continue the generation of s1 and s2 until (T−1) -th state located at the token index ℓ

    ,∀l∈[1, L] 19 Hista and Numca: Estimate State Value Effectively for Large Language Model Reinforcement Learning Proof. Suppose we continue the generation of s1 and s2 until (T−1) -th state located at the token index ℓ. The final answer will be output as (T−1) -th ac- tion. Then the enlongated hidden states at 0-th layer are ¯X0 1 = (x 0 1,1,x 0 1,2, ...,x...

  8. [8]

    A General Case In previous section, we only prove the correlation between MinDistanceof hidden states and final reward in a simple case, where

    ,∀l∈[1, L] A.3. A General Case In previous section, we only prove the correlation between MinDistanceof hidden states and final reward in a simple case, where

  9. [9]

    Hidden states ofs 1, s2 areX l 1,X l 2 ∈R η×d

  10. [10]

    The equation i= arg min j∈{1,...,η} ∥xl 1,i −x l 2,j∥2 al- ways holds for∀i∈ {1, . . . , η}

  11. [11]

    To expand Theorem A.11 to general case, we need to re- view these three assumptions

    The final answer of both states follow Assumption A.1 and A.2. To expand Theorem A.11 to general case, we need to re- view these three assumptions. Let’s begin with the Assump- tion A.2. Because we can control the generation length by controlling the sampling of output token, we can always ensure generated sequences from two states have certain length and...

  12. [12]

    Now, the problem becomes, the relation between the final reward of ˆs2 and s2

    = η1X i=1 ∥xl 1,i − ˆxl 2,i∥2 Then, we can apply the Theorem A.11 to Xl 1 and ˆXl 2 to de- rive the correlation. Now, the problem becomes, the relation between the final reward of ˆs2 and s2. Let’s control the gen- erated sequence length of ˆs2 and s2 so that the final answer of ˆs2 is the forward result of (ℓ+η 1 −η 2)-th token and the final answer of s2...

  13. [13]

    , η 2}, the dot product score p1,i is replicated ˆri times in the first block of p2, where ˆri are nonnegative integers satisfyingPη2 i=1 ˆri =η 1

    For each i∈ {1, . . . , η 2}, the dot product score p1,i is replicated ˆri times in the first block of p2, where ˆri are nonnegative integers satisfyingPη2 i=1 ˆri =η 1. (We allowˆri = 0for somei.)

  14. [14]

    , ℓ}, the entry p2,η1+i−η2 in p2 equalsp 1,i +δ i for someδ i ∈R

    For each i∈ {η 2 + 1, . . . , ℓ}, the entry p2,η1+i−η2 in p2 equalsp 1,i +δ i for someδ i ∈R. Define a1 := softmax(p1),a 2 := softmax(p2), and let a′ 2 ∈R ℓ be the vector obtained by aggregating the scores of a2’s first block according to the originatingp1,i and placing the aggregated mass in coordinate i (coordi- natesi > η 2 remain aligned as described)...

  15. [15]

    Without loss of generality, we assume s1’s hidden states Xl 1 ∈R η1,d and s2’s hidden states Xl 2 ∈R η2,d, where η1 > η2

    ,∀l∈[1, L] Proof. Without loss of generality, we assume s1’s hidden states Xl 1 ∈R η1,d and s2’s hidden states Xl 2 ∈R η2,d, where η1 > η2. To align Xl 1 and Xl 2, we construct a ”fake hidden states” forX l 2: ˆXl 2 = (ˆxl 2,1, ˆxl 2,2, ..., ˆxl 2,η1) such that fori-th token in ˆXl 2,it satisfies ˆxl 2,i =argmin xl 2,j ∈Xl 2 ∥xl 1,i −x l 2,j∥2 Suppose we ...

  16. [16]

    Now, we need to build correlation between ˆXl 2 and Xl

    =MD(X l 1, ˆXl 2). Now, we need to build correlation between ˆXl 2 and Xl

  17. [17]

    important

    We first omit the layer superscript l to build a general formula. According to Lemma A.12, the difference between attention score a2 of X2 and the aggregated attention score ˆa′ 2 of ˆX2 is: ∥a2 − ˆa′ 2∥2 2 = ℓX i=1 a2 2,i 1− κi ¯κ 2 ≤2 =B a. According to the setup of ˆX2 and the definition of aggre- gated ˆa′ 2 in Lemma A.12, it is obvious that (ˆa′ 2):η...

  18. [18]

    Substitute the L2 distance with(1−Cosine Similarity)

  19. [19]

    The result is presented in Table 10

    Change theMinDistanceoperation to sum of distance between hidden states in the same position, which isPn i=1 ∥x 1,i −x 2,i ∥. The result is presented in Table 10. Substituting theMinDis- tancecauses significant performance drop, which validates the Theorem 5.2 empirically. 26 Hista and Numca: Estimate State Value Effectively for Large Language Model Reinf...