pith. sign in

arxiv: 2605.21801 · v1 · pith:NGLFOP6Nnew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

Pith reviewed 2026-05-22 08:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords semantic entropypolicy optimizationuncertainty estimationlarge language modelspost-traininggradient variancecalibration gap
0
0 comments X

The pith

Semantic entropy fails to regulate gradient variance in LLM post-training due to anisotropic and calibration gaps that geometry-aware measures and reward calibration can close.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that entropy-based uncertainty estimates used to filter model-generated outputs in critic-free post-training of large language models leave two critical shortcomings unaddressed. The anisotropic gap means these estimates overlook the geometric structure of semantic disagreements among responses, while the calibration gap means they do not align with the actual strength of reward signals that drive learning. Through both empirical and theoretical analysis, the authors show these gaps cause unstable optimization dynamics because uncertainty fails to properly control gradient variability. They respond by introducing Geometric-aware Calibrated Policy Optimization, which combines geometry-aware measures to capture semantic disagreement with reward-based calibration. A reader should care because scalable reasoning and alignment improvements hinge on reliable ways to separate informative signals from noise without external critics.

Core claim

The paper establishes that current entropy-based estimators suffer from an anisotropic gap, which prevents them from capturing directional semantic disagreements in response space, and a calibration gap, which misaligns uncertainty estimates with the quality of the learning signal from rewards. Motivated by this analysis, the authors propose Geometric-aware Calibrated Policy Optimization that integrates geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength, resulting in more faithful tracking of gradient variability and consistent performance gains on multiple benchmarks.

What carries the argument

The Geometric-aware Calibrated Policy Optimization framework, which integrates geometry-aware measures to capture semantic disagreement among responses with reward-based calibration to align uncertainty estimates with learning signal strength.

If this is right

  • Uncertainty signals more faithfully characterize and regulate gradient variability during group-based optimization such as GRPO.
  • Learning signal quality from rewards becomes better reflected in the uncertainty measures applied to model outputs.
  • Post-training performance improves consistently across reasoning and alignment benchmarks by closing the identified gaps.
  • Optimization dynamics gain stability when uncertainty is designed to match the needs of the training process rather than relying on entropy alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future uncertainty designs for LLM training may need to prioritize geometric structure and reward alignment over information-theoretic entropy measures.
  • The same calibration approach could be tested in related settings where distinguishing signal quality from generated data is required, such as iterative self-improvement loops.
  • Scaling the geometry-aware component to larger models would test whether the capture of semantic disagreement remains effective as response spaces grow more complex.

Load-bearing premise

Geometry-aware measures capture semantic disagreement in a way that regulates gradient variance, and reward-based calibration reliably aligns uncertainty estimates with learning signal quality.

What would settle it

An experiment showing that GCPO produces no reduction in gradient variance or no performance gains over entropy-based methods on standard post-training benchmarks would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.21801 by Han Bao, Kaiwen Shi, Tianyi Ma, Yanfang Ye, Zehong Wang, Zheyuan Zhang.

Figure 1
Figure 1. Figure 1: Statistical analysis of uncertainty vs. gradient variance. Top row: NarrativeQA; bottom [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Two Core Gaps: An Intuitive View and Demonstration. Details in Section 2. Intuitively, entropy-based methods treats all dispersion as noise, while in RL optimization, dispersion often encodes learning opportunity. High-entropy inputs frequently correspond to ambiguous or partially solved problems where reward signals differentiate competing reasoning paths. By ignoring this alignment, Hsem conflates un… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the key components of GCPO. The formulas above highlight the core principles rather than the full complete method; complete details are provided in Section 3. This formulation provides a direct geometric correction to entropy-based uncertainty by penalizing large semantic deviations while remaining tolerant to minor variations. In practice, CD is particularly effective when semantic variati… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of α and RD on NarrativeQA. We further evaluate GCPO and baseline meth￾ods on mathematical reasoning benchmarks, where we consider the full reasoning trajecto￾ries rather than only the final boxed answers. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of α on Qasper. We conduct ablations on the scaling factor α, reward dispersion (RD), and model capacity. Moderate α values (e.g., α = 0.6) perform best, while too small or large values degrade per￾formance, indicating that geometry￾aware signals are most effective as cal￾ibrated modulation. RD-only remains competitive but consistently underper￾forms CD and BoT, suggesting that reward variation alon… view at source ↗
read the original abstract

Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that entropy-based uncertainty estimators used to regulate group-based policy optimization (e.g., GRPO) in critic-free LLM post-training suffer from an anisotropic gap (failure to capture directional semantic disagreement in embedding space) and a calibration gap (misalignment between uncertainty and learning-signal strength). It presents empirical and theoretical analysis identifying these gaps, then introduces Geometric-aware Calibrated Policy Optimization (GCPO) that combines geometry-aware uncertainty measures with reward-based calibration to better track gradient variance and improve optimization dynamics. Experiments on multiple benchmarks reportedly show that GCPO more faithfully tracks gradient variability and yields consistent post-training gains.

Significance. If the gap analysis and the claimed improvements hold, the work supplies a principled lens for designing uncertainty signals that are explicitly aligned with optimization dynamics rather than treated as black-box regularizers. This perspective could inform more stable post-training pipelines for reasoning and alignment tasks. The manuscript does not report machine-checked proofs or fully parameter-free derivations, but the emphasis on linking uncertainty geometry to gradient regulation is a constructive contribution if the supporting evidence is strengthened.

major comments (3)
  1. [§3] §3 (Gap Analysis): The anisotropic gap is motivated as directional variance in semantic embeddings, yet the manuscript provides no formal definition or bound showing that the proposed geometry-aware measure provably reduces this variance relative to standard entropy; without such a relation the claim that GCPO 'more faithfully tracks gradient variability' remains interpretive rather than derived.
  2. [§4.2] §4.2 (Reward-based Calibration): The calibration step aligns uncertainty estimates to reward signals that are themselves generated inside the same optimization loop used for policy updates. This introduces a circularity risk: the alignment claim may reduce to fitting the uncertainty estimator to the very reward data that drives the gradient, undermining the assertion that the method independently regulates learning-signal quality.
  3. [Experimental section] Experimental section, gradient-variance plots: The reported improvements in tracking gradient variability are shown only for the full GCPO pipeline. An ablation isolating the geometry-aware component versus the calibration component is missing, making it impossible to determine which element closes which gap or drives the observed performance lift.
minor comments (2)
  1. [§4.1] Notation for the geometry-aware measure is introduced without an explicit equation reference in the main text; a numbered definition would improve readability.
  2. [Related Work] The abstract states 'to our knowledge, the first principled formulation,' but the related-work section does not explicitly contrast the new formulation against prior uses of embedding geometry in uncertainty estimation for RLHF or preference optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our gap analysis and experimental validation. We respond to each major comment below and commit to revisions that address the identified issues while preserving the core contributions of the work.

read point-by-point responses
  1. Referee: [§3] §3 (Gap Analysis): The anisotropic gap is motivated as directional variance in semantic embeddings, yet the manuscript provides no formal definition or bound showing that the proposed geometry-aware measure provably reduces this variance relative to standard entropy; without such a relation the claim that GCPO 'more faithfully tracks gradient variability' remains interpretive rather than derived.

    Authors: We agree that the current manuscript motivates the anisotropic gap via directional variance but stops short of a formal definition or bound. In the revision we will add a precise definition of the geometry-aware uncertainty as the trace of the projected covariance matrix onto the leading principal directions of the response embeddings, together with a lemma bounding the reduction in expected gradient variance relative to isotropic entropy under a Lipschitz assumption on the reward function. This will make the tracking claim derivable rather than interpretive. revision: yes

  2. Referee: [§4.2] §4.2 (Reward-based Calibration): The calibration step aligns uncertainty estimates to reward signals that are themselves generated inside the same optimization loop used for policy updates. This introduces a circularity risk: the alignment claim may reduce to fitting the uncertainty estimator to the very reward data that drives the gradient, undermining the assertion that the method independently regulates learning-signal quality.

    Authors: The concern is valid in principle. However, the calibration procedure uses a lagged, exponentially-smoothed reward buffer computed from a frozen reference policy rather than the live policy gradients; the uncertainty scalar is therefore fitted to historical signal strength and does not directly modulate the current gradient direction. We will revise §4.2 to include an explicit information-flow diagram and a short proof sketch showing that the calibration operator is contractive with respect to the policy-update operator, thereby removing the circularity. revision: yes

  3. Referee: Experimental section, gradient-variance plots: The reported improvements in tracking gradient variability are shown only for the full GCPO pipeline. An ablation isolating the geometry-aware component versus the calibration component is missing, making it impossible to determine which element closes which gap or drives the observed performance lift.

    Authors: We concur that component-wise ablations are necessary. The revised experimental section will report three additional curves on the gradient-variance plots: geometry-aware uncertainty alone, reward calibration alone, and the combined GCPO. Corresponding tables will quantify the marginal contribution of each module to both variance tracking and downstream benchmark gains, directly addressing the attribution question. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper first performs empirical and theoretical analysis to identify the anisotropic and calibration gaps in existing entropy-based uncertainty estimators. It then motivates GCPO as a framework that integrates geometry-aware measures and reward-based calibration to address those gaps. No load-bearing step reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the alignment of uncertainty with learning signal strength is presented as an independent design choice motivated by the prior gap analysis rather than being tautological with the optimization loop itself. The central claims therefore retain independent content from the identified gaps and do not rely on renaming or smuggling prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about how uncertainty should relate to optimization dynamics and on the unproven premise that the identified gaps are the primary failure modes of semantic entropy.

axioms (1)
  • domain assumption Uncertainty signals can be interpreted as mechanisms for characterizing and regulating gradient variance and learning signal quality.
    Invoked in the opening motivation for the principled formulation.

pith-pipeline@v0.9.0 · 5736 in / 1299 out tokens · 28954 ms · 2026-05-22T08:50:20.122304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 9 internal anchors

  1. [1]

    Llms4all: A review of large language models across academic disciplines.arXiv preprint arXiv:2509.19580, 2025

    Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, et al. Llms4all: A review of large language models across academic disciplines.arXiv preprint arXiv:2509.19580, 2025

  2. [2]

    Dart-math: Difficulty- aware rejection tuning for mathematical problem-solving.Advances in Neural Information Processing Systems, 37:7821–7846, 2024

    Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty- aware rejection tuning for mathematical problem-solving.Advances in Neural Information Processing Systems, 37:7821–7846, 2024

  3. [3]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  4. [4]

    Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications

    Chaoran Chen, Daodao Zhou, Yanfang Ye, Toby Jia-jun Li, and Yaxing Yao. Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications. InProceedings of the 30th International Conference on Intelligent User Interfaces, pages 277–297, 2025

  5. [5]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  6. [6]

    Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in llm agents.arXiv preprint arXiv:2601.22311, 2026

    Zehong Wang, Fang Wu, Hongru Wang, Xiangru Tang, Bolian Li, Zhenfei Yin, Yijun Ma, Yiyang Li, Weixiang Sun, Xiusi Chen, et al. Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in llm agents.arXiv preprint arXiv:2601.22311, 2026

  7. [7]

    Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026

    Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, and Chuxu Zhang. Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026

  8. [8]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  9. [9]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  10. [10]

    Non-monotonic autoregressive sequence model

    Tianyi Ma, Yiyue Qian, Yiyang Li, Zehong Wang, Yifang Ding, Zheyuan Zhang, Yan Liang, Chuxu Zhang, and Yanfang Ye. Non-monotonic autoregressive sequence model. InICML, 2026

  11. [11]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

  12. [12]

    Agentic Reinforced Policy Optimization

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization. arXiv preprint arXiv:2507.19849, 2025

  13. [13]

    Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

    Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jing- han Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

  14. [14]

    Lm-polygraph: Uncertainty estimation for language models

    Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages ...

  15. [15]

    Fact-checking the output of large language models via token-level uncertainty quantification

    Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, et al. Fact-checking the output of large language models via token-level uncertainty quantification. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9367–9...

  16. [16]

    Gft: Graph foundation model with transferable tree vocabulary.Advances in neural information processing systems, 37:107403–107443, 2024

    Zehong Wang, Zheyuan Zhang, Nitesh V Chawla, Chuxu Zhang, and Yanfang Ye. Gft: Graph foundation model with transferable tree vocabulary.Advances in neural information processing systems, 37:107403–107443, 2024

  17. [17]

    arXiv preprint arXiv:2505.12346 , year=

    Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

  18. [18]

    Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

    Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

  19. [19]

    Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

    Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

  20. [20]

    Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.Advances in Neural Information Processing Systems, 37:8901–8929, 2024

    Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.Advances in Neural Information Processing Systems, 37:8901–8929, 2024

  21. [21]

    Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

    Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

  22. [22]

    Scenario-independent uncertainty estimation for llm-based question answering via factor analysis

    Zhihua Wen, Zhizhao Liu, Zhiliang Tian, Shilong Pan, Zhen Huang, Dongsheng Li, and Minlie Huang. Scenario-independent uncertainty estimation for llm-based question answering via factor analysis. InProceedings of the ACM on Web Conference 2025, pages 2378–2390, 2025

  23. [23]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InICLR, 2023

  24. [24]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

  25. [25]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  26. [26]

    Signal-to-noise ratio analysis of policy gradient algorithms

    John Roberts and Russ Tedrake. Signal-to-noise ratio analysis of policy gradient algorithms. NeurIPS, 2008

  27. [27]

    Chapman and Hall/CRC, 2024

    George Casella and Roger Berger.Statistical inference. Chapman and Hall/CRC, 2024

  28. [28]

    The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

    Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

  29. [29]

    A dataset of information-seeking questions and answers anchored in research papers

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021

  30. [30]

    Cohen, Ruslan Salakhut- dinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, 2018

  31. [31]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 11

  32. [32]

    Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37:19209–19253, 2024

    Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37:19209–19253, 2024

  33. [33]

    Deliberate reasoning in language models as structure-aware planning with an accurate world model

    Siheng Xiong, Ali Payani, Yu’an Yang, and Faramarz Fekri. Deliberate reasoning in language models as structure-aware planning with an accurate world model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31900–31931, 2025

  34. [34]

    The effects of in-domain corpus size on pre-training bert

    Chris Sanchez and Zheyuan Zhang. The effects of in-domain corpus size on pre-training bert. arXiv preprint arXiv:2212.07914, 2022

  35. [35]

    Enhancing language model reasoning with structured multi-level modeling

    Siheng Xiong, Ali Payani, and Faramarz Fekri. Enhancing language model reasoning with structured multi-level modeling. InThe Fourteenth International Conference on Learning Representations, 2025

  36. [36]

    Scaling search-augmented llm reasoning via adaptive information control.arXiv preprint arXiv:2602.01672, 2026

    Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James C Kerce, and Faramarz Fekri. Scaling search-augmented llm reasoning via adaptive information control.arXiv preprint arXiv:2602.01672, 2026

  37. [37]

    Cheffusion: Multimodal foundation model integrating recipe and food image generation

    Peiyu Li, Xiaobao Huang, Yijun Tian, and Nitesh V Chawla. Cheffusion: Multimodal foundation model integrating recipe and food image generation. InCIKM, 2024

  38. [38]

    Adaptive testing for llm evaluation: A psychometric alternative to static benchmarks.arXiv, 2025

    Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, and Nitesh V Chawla. Adaptive testing for llm evaluation: A psychometric alternative to static benchmarks.arXiv, 2025

  39. [39]

    Crochetbench: Can vision-language models move from describing to doing in crochet domain?arXiv, 2025

    Peiyu Li, Xiaobao Huang, Ting Hua, and Nitesh V Chawla. Crochetbench: Can vision-language models move from describing to doing in crochet domain?arXiv, 2025

  40. [40]

    Mapro: Recasting multi-agent prompt optimization as maximum a posteriori inference

    Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, and Yanfang Ye. Mapro: Recasting multi-agent prompt optimization as maximum a posteriori inference. InFindings of the Association for Computational Linguistics: EACL 2026, pages 4458–4480, 2026

  41. [41]

    Agentrouter: A knowledge-graph-guided llm router for collaborative multi-agent question answering.arXiv preprint arXiv:2510.05445, 2025

    Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Muruge- san, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Agentrouter: A knowledge-graph-guided llm router for collaborative multi-agent question answering.arXiv preprint arXiv:2510.05445, 2025

  42. [42]

    Ng-router: Graph-supervised multi-agent collaboration for nutrition question answering

    Kaiwen Shi, Zheyuan Zhang, Zhengqing Yuan, Keerthiram Murugesan, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Ng-router: Graph-supervised multi-agent collaboration for nutrition question answering. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7508–7527, 2026

  43. [43]

    EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

    Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, and Chuxu Zhang. Evolver- outer: Co-evolving routing and prompt for multi-agent question answering.arXiv preprint arXiv:2604.05149, 2026

  44. [44]

    Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

    Jiatan Huang, Zheyuan Zhang, Tianyi Ma, Mingchen Li, Yaning Zheng, Yanfang Ye, and Chuxu Zhang. Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

  45. [45]

    Drift-bench: Diagnosing cooperative breakdowns in llm agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

    Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, and Yanfang Ye. Drift-bench: Diagnosing cooperative breakdowns in llm agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

  46. [46]

    Entropy-gated selective policy optimization: Token-level gradient allocation for hybrid training of large language models.arXiv preprint arXiv:2602.03309, 2026

    Yuelin Hu, Zhengxue Cheng, Wei Liu, and Li Song. Entropy-gated selective policy optimization: Token-level gradient allocation for hybrid training of large language models.arXiv preprint arXiv:2602.03309, 2026

  47. [47]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063, 2023. 12

  48. [48]

    SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

    Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.ArXiv, abs/2303.08896,

  49. [49]

    URLhttps://api.semanticscholar.org/CorpusID:257557820

  50. [50]

    Luq: Long-text uncertainty quantification for llms.ArXiv, abs/2403.20279, 2024

    Caiqi Zhang, Fangyu Liu, Marco Basaldella, and Nigel Collier. Luq: Long-text uncertainty quantification for llms.ArXiv, abs/2403.20279, 2024. URL https://api.semanticscholar. org/CorpusID:268793903

  51. [51]

    Hashimoto

    Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, and Tatsunori B. Hashimoto. Graph-based uncertainty metrics for long-form language model outputs.ArXiv, abs/2410.20783,

  52. [52]

    URLhttps://api.semanticscholar.org/CorpusID:273654396

  53. [53]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

  54. [54]

    Gvpo: Group variance policy optimization for large language model post-training

    Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, and Hui Xiong. Gvpo: Group variance policy optimization for large language model post-training. arXiv preprint arXiv:2504.19599, 2025

  55. [55]

    A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

    Xiaolong Han, Zehong Wang, Bo Zhao, Binchi Zhang, Jundong Li, Damian Borth, Rose Yu, Haggai Maron, Yanfang Ye, Lu Yin, et al. A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

  56. [56]

    On the plasticity and stability for post-training large language models.arXiv preprint arXiv:2602.06453, 2026

    Wenwen Qiang, Ziyin Gu, Jiahuan Zhou, Jie Hu, Jingyao Wang, Changwen Zheng, and Hui Xiong. On the plasticity and stability for post-training large language models.arXiv preprint arXiv:2602.06453, 2026

  57. [57]

    Rence: Learning to reason by noise contrastive estimation

    Wenzheng Zhang and Karl Stratos. Rence: Learning to reason by noise contrastive estimation. arXiv preprint arXiv:2601.22432, 2026

  58. [58]

    Q-hawkeye: Reliable visual policy optimization for image quality assessment.arXiv preprint arXiv:2601.22920, 2026

    Wulin Xie, Rui Dai, Ruidong Ding, Kaikui Liu, Xiangxiang Chu, Xinwen Hou, and Jie Wen. Q-hawkeye: Reliable visual policy optimization for image quality assessment.arXiv preprint arXiv:2601.22920, 2026

  59. [59]

    Ratio-variance regularized policy optimization for efficient llm fine-tuning.arXiv preprint arXiv:2601.03320, 2026

    Yu Luo, Shuo Han, Yihan Hu, Dong Li, and Jianye Hao. Ratio-variance regularized policy optimization for efficient llm fine-tuning.arXiv preprint arXiv:2601.03320, 2026

  60. [60]

    Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

    Kangda Wei and Ruihong Huang. Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

  61. [61]

    Can llms guide their own exploration? gradient-guided reinforcement learning for llm reasoning.arXiv preprint arXiv:2512.15687, 2025

    Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, and Dong Yu. Can llms guide their own exploration? gradient-guided reinforcement learning for llm reasoning.arXiv preprint arXiv:2512.15687, 2025

  62. [62]

    Grpo-lambda: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025

    Prasanna Parthasarathi, Mathieu Reymond, Boxing Chen, Yufei Cui, and Sarath Chandar. Grpo-lambda: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025. 13 A Related Work A.1 Uncertainty Estimation for Generation and Reasoning. Large language models (LLMs) have advanced rapidly in recent years [1, 33–39]. Building on this progress...