Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

Han Bao; Kaiwen Shi; Tianyi Ma; Yanfang Ye; Zehong Wang; Zheyuan Zhang

arxiv: 2605.21801 · v1 · pith:NGLFOP6Nnew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

Zheyuan Zhang , Kaiwen Shi , Han Bao , Zehong Wang , Tianyi Ma , Yanfang Ye This is my paper

Pith reviewed 2026-05-22 08:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords semantic entropypolicy optimizationuncertainty estimationlarge language modelspost-traininggradient variancecalibration gap

0 comments

The pith

Semantic entropy fails to regulate gradient variance in LLM post-training due to anisotropic and calibration gaps that geometry-aware measures and reward calibration can close.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that entropy-based uncertainty estimates used to filter model-generated outputs in critic-free post-training of large language models leave two critical shortcomings unaddressed. The anisotropic gap means these estimates overlook the geometric structure of semantic disagreements among responses, while the calibration gap means they do not align with the actual strength of reward signals that drive learning. Through both empirical and theoretical analysis, the authors show these gaps cause unstable optimization dynamics because uncertainty fails to properly control gradient variability. They respond by introducing Geometric-aware Calibrated Policy Optimization, which combines geometry-aware measures to capture semantic disagreement with reward-based calibration. A reader should care because scalable reasoning and alignment improvements hinge on reliable ways to separate informative signals from noise without external critics.

Core claim

The paper establishes that current entropy-based estimators suffer from an anisotropic gap, which prevents them from capturing directional semantic disagreements in response space, and a calibration gap, which misaligns uncertainty estimates with the quality of the learning signal from rewards. Motivated by this analysis, the authors propose Geometric-aware Calibrated Policy Optimization that integrates geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength, resulting in more faithful tracking of gradient variability and consistent performance gains on multiple benchmarks.

What carries the argument

The Geometric-aware Calibrated Policy Optimization framework, which integrates geometry-aware measures to capture semantic disagreement among responses with reward-based calibration to align uncertainty estimates with learning signal strength.

If this is right

Uncertainty signals more faithfully characterize and regulate gradient variability during group-based optimization such as GRPO.
Learning signal quality from rewards becomes better reflected in the uncertainty measures applied to model outputs.
Post-training performance improves consistently across reasoning and alignment benchmarks by closing the identified gaps.
Optimization dynamics gain stability when uncertainty is designed to match the needs of the training process rather than relying on entropy alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future uncertainty designs for LLM training may need to prioritize geometric structure and reward alignment over information-theoretic entropy measures.
The same calibration approach could be tested in related settings where distinguishing signal quality from generated data is required, such as iterative self-improvement loops.
Scaling the geometry-aware component to larger models would test whether the capture of semantic disagreement remains effective as response spaces grow more complex.

Load-bearing premise

Geometry-aware measures capture semantic disagreement in a way that regulates gradient variance, and reward-based calibration reliably aligns uncertainty estimates with learning signal quality.

What would settle it

An experiment showing that GCPO produces no reduction in gradient variance or no performance gains over entropy-based methods on standard post-training benchmarks would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.21801 by Han Bao, Kaiwen Shi, Tianyi Ma, Yanfang Ye, Zehong Wang, Zheyuan Zhang.

**Figure 2.** Figure 2: The Two Core Gaps: An Intuitive View and Demonstration. Details in Section 2. Intuitively, entropy-based methods treats all dispersion as noise, while in RL optimization, dispersion often encodes learning opportunity. High-entropy inputs frequently correspond to ambiguous or partially solved problems where reward signals differentiate competing reasoning paths. By ignoring this alignment, Hsem conflates un… view at source ↗

**Figure 3.** Figure 3: Illustration of the key components of GCPO. The formulas above highlight the core principles rather than the full complete method; complete details are provided in Section 3. This formulation provides a direct geometric correction to entropy-based uncertainty by penalizing large semantic deviations while remaining tolerant to minor variations. In practice, CD is particularly effective when semantic variati… view at source ↗

**Figure 4.** Figure 4: Effect of α and RD on NarrativeQA. We further evaluate GCPO and baseline methods on mathematical reasoning benchmarks, where we consider the full reasoning trajectories rather than only the final boxed answers. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of α on Qasper. We conduct ablations on the scaling factor α, reward dispersion (RD), and model capacity. Moderate α values (e.g., α = 0.6) perform best, while too small or large values degrade performance, indicating that geometryaware signals are most effective as calibrated modulation. RD-only remains competitive but consistently underperforms CD and BoT, suggesting that reward variation alon… view at source ↗

read the original abstract

Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags real issues with semantic entropy for LLM post-training and offers GCPO as a fix via geometry and reward calibration, but the supporting analysis stays mostly high-level.

read the letter

The main point is that semantic entropy struggles with anisotropic effects and poor calibration when used to steer group-based optimization like GRPO in critic-free LLM training. The authors position GCPO as a direct response that uses geometry-aware measures to better reflect semantic disagreement and reward calibration to tie uncertainty estimates to actual learning signal strength. They claim this leads to more faithful tracking of gradient variability and steadier post-training gains on benchmarks.

Referee Report

3 major / 2 minor

Summary. The paper claims that entropy-based uncertainty estimators used to regulate group-based policy optimization (e.g., GRPO) in critic-free LLM post-training suffer from an anisotropic gap (failure to capture directional semantic disagreement in embedding space) and a calibration gap (misalignment between uncertainty and learning-signal strength). It presents empirical and theoretical analysis identifying these gaps, then introduces Geometric-aware Calibrated Policy Optimization (GCPO) that combines geometry-aware uncertainty measures with reward-based calibration to better track gradient variance and improve optimization dynamics. Experiments on multiple benchmarks reportedly show that GCPO more faithfully tracks gradient variability and yields consistent post-training gains.

Significance. If the gap analysis and the claimed improvements hold, the work supplies a principled lens for designing uncertainty signals that are explicitly aligned with optimization dynamics rather than treated as black-box regularizers. This perspective could inform more stable post-training pipelines for reasoning and alignment tasks. The manuscript does not report machine-checked proofs or fully parameter-free derivations, but the emphasis on linking uncertainty geometry to gradient regulation is a constructive contribution if the supporting evidence is strengthened.

major comments (3)

[§3] §3 (Gap Analysis): The anisotropic gap is motivated as directional variance in semantic embeddings, yet the manuscript provides no formal definition or bound showing that the proposed geometry-aware measure provably reduces this variance relative to standard entropy; without such a relation the claim that GCPO 'more faithfully tracks gradient variability' remains interpretive rather than derived.
[§4.2] §4.2 (Reward-based Calibration): The calibration step aligns uncertainty estimates to reward signals that are themselves generated inside the same optimization loop used for policy updates. This introduces a circularity risk: the alignment claim may reduce to fitting the uncertainty estimator to the very reward data that drives the gradient, undermining the assertion that the method independently regulates learning-signal quality.
[Experimental section] Experimental section, gradient-variance plots: The reported improvements in tracking gradient variability are shown only for the full GCPO pipeline. An ablation isolating the geometry-aware component versus the calibration component is missing, making it impossible to determine which element closes which gap or drives the observed performance lift.

minor comments (2)

[§4.1] Notation for the geometry-aware measure is introduced without an explicit equation reference in the main text; a numbered definition would improve readability.
[Related Work] The abstract states 'to our knowledge, the first principled formulation,' but the related-work section does not explicitly contrast the new formulation against prior uses of embedding geometry in uncertainty estimation for RLHF or preference optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our gap analysis and experimental validation. We respond to each major comment below and commit to revisions that address the identified issues while preserving the core contributions of the work.

read point-by-point responses

Referee: [§3] §3 (Gap Analysis): The anisotropic gap is motivated as directional variance in semantic embeddings, yet the manuscript provides no formal definition or bound showing that the proposed geometry-aware measure provably reduces this variance relative to standard entropy; without such a relation the claim that GCPO 'more faithfully tracks gradient variability' remains interpretive rather than derived.

Authors: We agree that the current manuscript motivates the anisotropic gap via directional variance but stops short of a formal definition or bound. In the revision we will add a precise definition of the geometry-aware uncertainty as the trace of the projected covariance matrix onto the leading principal directions of the response embeddings, together with a lemma bounding the reduction in expected gradient variance relative to isotropic entropy under a Lipschitz assumption on the reward function. This will make the tracking claim derivable rather than interpretive. revision: yes
Referee: [§4.2] §4.2 (Reward-based Calibration): The calibration step aligns uncertainty estimates to reward signals that are themselves generated inside the same optimization loop used for policy updates. This introduces a circularity risk: the alignment claim may reduce to fitting the uncertainty estimator to the very reward data that drives the gradient, undermining the assertion that the method independently regulates learning-signal quality.

Authors: The concern is valid in principle. However, the calibration procedure uses a lagged, exponentially-smoothed reward buffer computed from a frozen reference policy rather than the live policy gradients; the uncertainty scalar is therefore fitted to historical signal strength and does not directly modulate the current gradient direction. We will revise §4.2 to include an explicit information-flow diagram and a short proof sketch showing that the calibration operator is contractive with respect to the policy-update operator, thereby removing the circularity. revision: yes
Referee: Experimental section, gradient-variance plots: The reported improvements in tracking gradient variability are shown only for the full GCPO pipeline. An ablation isolating the geometry-aware component versus the calibration component is missing, making it impossible to determine which element closes which gap or drives the observed performance lift.

Authors: We concur that component-wise ablations are necessary. The revised experimental section will report three additional curves on the gradient-variance plots: geometry-aware uncertainty alone, reward calibration alone, and the combined GCPO. Corresponding tables will quantify the marginal contribution of each module to both variance tracking and downstream benchmark gains, directly addressing the attribution question. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper first performs empirical and theoretical analysis to identify the anisotropic and calibration gaps in existing entropy-based uncertainty estimators. It then motivates GCPO as a framework that integrates geometry-aware measures and reward-based calibration to address those gaps. No load-bearing step reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the alignment of uncertainty with learning signal strength is presented as an independent design choice motivated by the prior gap analysis rather than being tautological with the optimization loop itself. The central claims therefore retain independent content from the identified gaps and do not rely on renaming or smuggling prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about how uncertainty should relate to optimization dynamics and on the unproven premise that the identified gaps are the primary failure modes of semantic entropy.

axioms (1)

domain assumption Uncertainty signals can be interpreted as mechanisms for characterizing and regulating gradient variance and learning signal quality.
Invoked in the opening motivation for the principled formulation.

pith-pipeline@v0.9.0 · 5736 in / 1299 out tokens · 28954 ms · 2026-05-22T08:50:20.122304+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce geometry-aware measures, including Cosine Dispersion (CD) and Barycentric Transport (BoT), to capture semantic disagreement beyond entropy, and further incorporate a Reward Dispersion (RD) module to align update strength with reward informativeness.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

V(x) = Σ pk Tr(Cov(g|Z=k)) + Tr(Cov(μZ)) (intra- vs inter-cluster gradient variance)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 9 internal anchors

[1]

Llms4all: A review of large language models across academic disciplines.arXiv preprint arXiv:2509.19580, 2025

Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, et al. Llms4all: A review of large language models across academic disciplines.arXiv preprint arXiv:2509.19580, 2025

work page arXiv 2025
[2]

Dart-math: Difficulty- aware rejection tuning for mathematical problem-solving.Advances in Neural Information Processing Systems, 37:7821–7846, 2024

Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty- aware rejection tuning for mathematical problem-solving.Advances in Neural Information Processing Systems, 37:7821–7846, 2024

work page 2024
[3]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[4]

Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications

Chaoran Chen, Daodao Zhou, Yanfang Ye, Toby Jia-jun Li, and Yaxing Yao. Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications. InProceedings of the 30th International Conference on Intelligent User Interfaces, pages 277–297, 2025

work page 2025
[5]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in llm agents.arXiv preprint arXiv:2601.22311, 2026

Zehong Wang, Fang Wu, Hongru Wang, Xiangru Tang, Bolian Li, Zhenfei Yin, Yijun Ma, Yiyang Li, Weixiang Sun, Xiusi Chen, et al. Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in llm agents.arXiv preprint arXiv:2601.22311, 2026

work page arXiv 2026
[7]

Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026

Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, and Chuxu Zhang. Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026

work page arXiv 2026
[8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Non-monotonic autoregressive sequence model

Tianyi Ma, Yiyue Qian, Yiyang Li, Zehong Wang, Yifang Ding, Zheyuan Zhang, Yan Liang, Chuxu Zhang, and Yanfang Ye. Non-monotonic autoregressive sequence model. InICML, 2026

work page 2026
[11]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

work page 2024
[12]

Agentic Reinforced Policy Optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization. arXiv preprint arXiv:2507.19849, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jing- han Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

work page arXiv 2025
[14]

Lm-polygraph: Uncertainty estimation for language models

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages ...

work page 2023
[15]

Fact-checking the output of large language models via token-level uncertainty quantification

Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, et al. Fact-checking the output of large language models via token-level uncertainty quantification. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9367–9...

work page 2024
[16]

Gft: Graph foundation model with transferable tree vocabulary.Advances in neural information processing systems, 37:107403–107443, 2024

Zehong Wang, Zheyuan Zhang, Nitesh V Chawla, Chuxu Zhang, and Yanfang Ye. Gft: Graph foundation model with transferable tree vocabulary.Advances in neural information processing systems, 37:107403–107443, 2024

work page 2024
[17]

arXiv preprint arXiv:2505.12346 , year=

Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

work page arXiv 2025
[18]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page arXiv 2025
[19]

Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

work page arXiv 2025
[20]

Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.Advances in Neural Information Processing Systems, 37:8901–8929, 2024

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.Advances in Neural Information Processing Systems, 37:8901–8929, 2024

work page 2024
[21]

Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

work page arXiv 2025
[22]

Scenario-independent uncertainty estimation for llm-based question answering via factor analysis

Zhihua Wen, Zhizhao Liu, Zhiliang Tian, Shilong Pan, Zhen Huang, Dongsheng Li, and Minlie Huang. Scenario-independent uncertainty estimation for llm-based question answering via factor analysis. InProceedings of the ACM on Web Conference 2025, pages 2378–2390, 2025

work page 2025
[23]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InICLR, 2023

work page 2023
[24]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

work page 2018
[25]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Signal-to-noise ratio analysis of policy gradient algorithms

John Roberts and Russ Tedrake. Signal-to-noise ratio analysis of policy gradient algorithms. NeurIPS, 2008

work page 2008
[27]

Chapman and Hall/CRC, 2024

George Casella and Roger Berger.Statistical inference. Chapman and Hall/CRC, 2024

work page 2024
[28]

The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

work page 2018
[29]

A dataset of information-seeking questions and answers anchored in research papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021

work page 2021
[30]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, 2018

work page 2018
[31]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37:19209–19253, 2024

Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37:19209–19253, 2024

work page 2024
[33]

Deliberate reasoning in language models as structure-aware planning with an accurate world model

Siheng Xiong, Ali Payani, Yu’an Yang, and Faramarz Fekri. Deliberate reasoning in language models as structure-aware planning with an accurate world model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31900–31931, 2025

work page 2025
[34]

The effects of in-domain corpus size on pre-training bert

Chris Sanchez and Zheyuan Zhang. The effects of in-domain corpus size on pre-training bert. arXiv preprint arXiv:2212.07914, 2022

work page arXiv 2022
[35]

Enhancing language model reasoning with structured multi-level modeling

Siheng Xiong, Ali Payani, and Faramarz Fekri. Enhancing language model reasoning with structured multi-level modeling. InThe Fourteenth International Conference on Learning Representations, 2025

work page 2025
[36]

Scaling search-augmented llm reasoning via adaptive information control.arXiv preprint arXiv:2602.01672, 2026

Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James C Kerce, and Faramarz Fekri. Scaling search-augmented llm reasoning via adaptive information control.arXiv preprint arXiv:2602.01672, 2026

work page arXiv 2026
[37]

Cheffusion: Multimodal foundation model integrating recipe and food image generation

Peiyu Li, Xiaobao Huang, Yijun Tian, and Nitesh V Chawla. Cheffusion: Multimodal foundation model integrating recipe and food image generation. InCIKM, 2024

work page 2024
[38]

Adaptive testing for llm evaluation: A psychometric alternative to static benchmarks.arXiv, 2025

Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, and Nitesh V Chawla. Adaptive testing for llm evaluation: A psychometric alternative to static benchmarks.arXiv, 2025

work page 2025
[39]

Crochetbench: Can vision-language models move from describing to doing in crochet domain?arXiv, 2025

Peiyu Li, Xiaobao Huang, Ting Hua, and Nitesh V Chawla. Crochetbench: Can vision-language models move from describing to doing in crochet domain?arXiv, 2025

work page 2025
[40]

Mapro: Recasting multi-agent prompt optimization as maximum a posteriori inference

Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, and Yanfang Ye. Mapro: Recasting multi-agent prompt optimization as maximum a posteriori inference. InFindings of the Association for Computational Linguistics: EACL 2026, pages 4458–4480, 2026

work page 2026
[41]

Agentrouter: A knowledge-graph-guided llm router for collaborative multi-agent question answering.arXiv preprint arXiv:2510.05445, 2025

Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Muruge- san, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Agentrouter: A knowledge-graph-guided llm router for collaborative multi-agent question answering.arXiv preprint arXiv:2510.05445, 2025

work page arXiv 2025
[42]

Ng-router: Graph-supervised multi-agent collaboration for nutrition question answering

Kaiwen Shi, Zheyuan Zhang, Zhengqing Yuan, Keerthiram Murugesan, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Ng-router: Graph-supervised multi-agent collaboration for nutrition question answering. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7508–7527, 2026

work page 2026
[43]

EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, and Chuxu Zhang. Evolver- outer: Co-evolving routing and prompt for multi-agent question answering.arXiv preprint arXiv:2604.05149, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

Jiatan Huang, Zheyuan Zhang, Tianyi Ma, Mingchen Li, Yaning Zheng, Yanfang Ye, and Chuxu Zhang. Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

work page arXiv 2026
[45]

Drift-bench: Diagnosing cooperative breakdowns in llm agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, and Yanfang Ye. Drift-bench: Diagnosing cooperative breakdowns in llm agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

work page arXiv 2026
[46]

Entropy-gated selective policy optimization: Token-level gradient allocation for hybrid training of large language models.arXiv preprint arXiv:2602.03309, 2026

Yuelin Hu, Zhengxue Cheng, Wei Liu, and Li Song. Entropy-gated selective policy optimization: Token-level gradient allocation for hybrid training of large language models.arXiv preprint arXiv:2602.03309, 2026

work page arXiv 2026
[47]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.ArXiv, abs/2303.08896,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

URLhttps://api.semanticscholar.org/CorpusID:257557820

work page
[50]

Luq: Long-text uncertainty quantification for llms.ArXiv, abs/2403.20279, 2024

Caiqi Zhang, Fangyu Liu, Marco Basaldella, and Nigel Collier. Luq: Long-text uncertainty quantification for llms.ArXiv, abs/2403.20279, 2024. URL https://api.semanticscholar. org/CorpusID:268793903

work page arXiv 2024
[51]

Hashimoto

Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, and Tatsunori B. Hashimoto. Graph-based uncertainty metrics for long-form language model outputs.ArXiv, abs/2410.20783,

work page arXiv
[52]

URLhttps://api.semanticscholar.org/CorpusID:273654396

work page
[53]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

work page 2024
[54]

Gvpo: Group variance policy optimization for large language model post-training

Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, and Hui Xiong. Gvpo: Group variance policy optimization for large language model post-training. arXiv preprint arXiv:2504.19599, 2025

work page arXiv 2025
[55]

A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

Xiaolong Han, Zehong Wang, Bo Zhao, Binchi Zhang, Jundong Li, Damian Borth, Rose Yu, Haggai Maron, Yanfang Ye, Lu Yin, et al. A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

work page arXiv 2026
[56]

On the plasticity and stability for post-training large language models.arXiv preprint arXiv:2602.06453, 2026

Wenwen Qiang, Ziyin Gu, Jiahuan Zhou, Jie Hu, Jingyao Wang, Changwen Zheng, and Hui Xiong. On the plasticity and stability for post-training large language models.arXiv preprint arXiv:2602.06453, 2026

work page arXiv 2026
[57]

Rence: Learning to reason by noise contrastive estimation

Wenzheng Zhang and Karl Stratos. Rence: Learning to reason by noise contrastive estimation. arXiv preprint arXiv:2601.22432, 2026

work page arXiv 2026
[58]

Q-hawkeye: Reliable visual policy optimization for image quality assessment.arXiv preprint arXiv:2601.22920, 2026

Wulin Xie, Rui Dai, Ruidong Ding, Kaikui Liu, Xiangxiang Chu, Xinwen Hou, and Jie Wen. Q-hawkeye: Reliable visual policy optimization for image quality assessment.arXiv preprint arXiv:2601.22920, 2026

work page arXiv 2026
[59]

Ratio-variance regularized policy optimization for efficient llm fine-tuning.arXiv preprint arXiv:2601.03320, 2026

Yu Luo, Shuo Han, Yihan Hu, Dong Li, and Jianye Hao. Ratio-variance regularized policy optimization for efficient llm fine-tuning.arXiv preprint arXiv:2601.03320, 2026

work page arXiv 2026
[60]

Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

Kangda Wei and Ruihong Huang. Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

work page arXiv 2026
[61]

Can llms guide their own exploration? gradient-guided reinforcement learning for llm reasoning.arXiv preprint arXiv:2512.15687, 2025

Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, and Dong Yu. Can llms guide their own exploration? gradient-guided reinforcement learning for llm reasoning.arXiv preprint arXiv:2512.15687, 2025

work page arXiv 2025
[62]

Grpo-lambda: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025

Prasanna Parthasarathi, Mathieu Reymond, Boxing Chen, Yufei Cui, and Sarath Chandar. Grpo-lambda: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025. 13 A Related Work A.1 Uncertainty Estimation for Generation and Reasoning. Large language models (LLMs) have advanced rapidly in recent years [1, 33–39]. Building on this progress...

work page arXiv 2025

[1] [1]

Llms4all: A review of large language models across academic disciplines.arXiv preprint arXiv:2509.19580, 2025

Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, et al. Llms4all: A review of large language models across academic disciplines.arXiv preprint arXiv:2509.19580, 2025

work page arXiv 2025

[2] [2]

Dart-math: Difficulty- aware rejection tuning for mathematical problem-solving.Advances in Neural Information Processing Systems, 37:7821–7846, 2024

Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty- aware rejection tuning for mathematical problem-solving.Advances in Neural Information Processing Systems, 37:7821–7846, 2024

work page 2024

[3] [3]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[4] [4]

Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications

Chaoran Chen, Daodao Zhou, Yanfang Ye, Toby Jia-jun Li, and Yaxing Yao. Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications. InProceedings of the 30th International Conference on Intelligent User Interfaces, pages 277–297, 2025

work page 2025

[5] [5]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in llm agents.arXiv preprint arXiv:2601.22311, 2026

Zehong Wang, Fang Wu, Hongru Wang, Xiangru Tang, Bolian Li, Zhenfei Yin, Yijun Ma, Yiyang Li, Weixiang Sun, Xiusi Chen, et al. Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in llm agents.arXiv preprint arXiv:2601.22311, 2026

work page arXiv 2026

[7] [7]

Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026

Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, and Chuxu Zhang. Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026

work page arXiv 2026

[8] [8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Non-monotonic autoregressive sequence model

Tianyi Ma, Yiyue Qian, Yiyang Li, Zehong Wang, Yifang Ding, Zheyuan Zhang, Yan Liang, Chuxu Zhang, and Yanfang Ye. Non-monotonic autoregressive sequence model. InICML, 2026

work page 2026

[11] [11]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

work page 2024

[12] [12]

Agentic Reinforced Policy Optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization. arXiv preprint arXiv:2507.19849, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jing- han Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025

work page arXiv 2025

[14] [14]

Lm-polygraph: Uncertainty estimation for language models

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages ...

work page 2023

[15] [15]

Fact-checking the output of large language models via token-level uncertainty quantification

Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, et al. Fact-checking the output of large language models via token-level uncertainty quantification. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9367–9...

work page 2024

[16] [16]

Gft: Graph foundation model with transferable tree vocabulary.Advances in neural information processing systems, 37:107403–107443, 2024

Zehong Wang, Zheyuan Zhang, Nitesh V Chawla, Chuxu Zhang, and Yanfang Ye. Gft: Graph foundation model with transferable tree vocabulary.Advances in neural information processing systems, 37:107403–107443, 2024

work page 2024

[17] [17]

arXiv preprint arXiv:2505.12346 , year=

Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025

work page arXiv 2025

[18] [18]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

work page arXiv 2025

[19] [19]

Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

work page arXiv 2025

[20] [20]

Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.Advances in Neural Information Processing Systems, 37:8901–8929, 2024

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.Advances in Neural Information Processing Systems, 37:8901–8929, 2024

work page 2024

[21] [21]

Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025

work page arXiv 2025

[22] [22]

Scenario-independent uncertainty estimation for llm-based question answering via factor analysis

Zhihua Wen, Zhizhao Liu, Zhiliang Tian, Shilong Pan, Zhen Huang, Dongsheng Li, and Minlie Huang. Scenario-independent uncertainty estimation for llm-based question answering via factor analysis. InProceedings of the ACM on Web Conference 2025, pages 2378–2390, 2025

work page 2025

[23] [23]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InICLR, 2023

work page 2023

[24] [24]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

work page 2018

[25] [25]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Signal-to-noise ratio analysis of policy gradient algorithms

John Roberts and Russ Tedrake. Signal-to-noise ratio analysis of policy gradient algorithms. NeurIPS, 2008

work page 2008

[27] [27]

Chapman and Hall/CRC, 2024

George Casella and Roger Berger.Statistical inference. Chapman and Hall/CRC, 2024

work page 2024

[28] [28]

The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

work page 2018

[29] [29]

A dataset of information-seeking questions and answers anchored in research papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021

work page 2021

[30] [30]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, 2018

work page 2018

[31] [31]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [32]

Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37:19209–19253, 2024

Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37:19209–19253, 2024

work page 2024

[33] [33]

Deliberate reasoning in language models as structure-aware planning with an accurate world model

Siheng Xiong, Ali Payani, Yu’an Yang, and Faramarz Fekri. Deliberate reasoning in language models as structure-aware planning with an accurate world model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31900–31931, 2025

work page 2025

[34] [34]

The effects of in-domain corpus size on pre-training bert

Chris Sanchez and Zheyuan Zhang. The effects of in-domain corpus size on pre-training bert. arXiv preprint arXiv:2212.07914, 2022

work page arXiv 2022

[35] [35]

Enhancing language model reasoning with structured multi-level modeling

Siheng Xiong, Ali Payani, and Faramarz Fekri. Enhancing language model reasoning with structured multi-level modeling. InThe Fourteenth International Conference on Learning Representations, 2025

work page 2025

[36] [36]

Scaling search-augmented llm reasoning via adaptive information control.arXiv preprint arXiv:2602.01672, 2026

Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James C Kerce, and Faramarz Fekri. Scaling search-augmented llm reasoning via adaptive information control.arXiv preprint arXiv:2602.01672, 2026

work page arXiv 2026

[37] [37]

Cheffusion: Multimodal foundation model integrating recipe and food image generation

Peiyu Li, Xiaobao Huang, Yijun Tian, and Nitesh V Chawla. Cheffusion: Multimodal foundation model integrating recipe and food image generation. InCIKM, 2024

work page 2024

[38] [38]

Adaptive testing for llm evaluation: A psychometric alternative to static benchmarks.arXiv, 2025

Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, and Nitesh V Chawla. Adaptive testing for llm evaluation: A psychometric alternative to static benchmarks.arXiv, 2025

work page 2025

[39] [39]

Crochetbench: Can vision-language models move from describing to doing in crochet domain?arXiv, 2025

Peiyu Li, Xiaobao Huang, Ting Hua, and Nitesh V Chawla. Crochetbench: Can vision-language models move from describing to doing in crochet domain?arXiv, 2025

work page 2025

[40] [40]

Mapro: Recasting multi-agent prompt optimization as maximum a posteriori inference

Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, and Yanfang Ye. Mapro: Recasting multi-agent prompt optimization as maximum a posteriori inference. InFindings of the Association for Computational Linguistics: EACL 2026, pages 4458–4480, 2026

work page 2026

[41] [41]

Agentrouter: A knowledge-graph-guided llm router for collaborative multi-agent question answering.arXiv preprint arXiv:2510.05445, 2025

Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Muruge- san, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Agentrouter: A knowledge-graph-guided llm router for collaborative multi-agent question answering.arXiv preprint arXiv:2510.05445, 2025

work page arXiv 2025

[42] [42]

Ng-router: Graph-supervised multi-agent collaboration for nutrition question answering

Kaiwen Shi, Zheyuan Zhang, Zhengqing Yuan, Keerthiram Murugesan, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Ng-router: Graph-supervised multi-agent collaboration for nutrition question answering. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7508–7527, 2026

work page 2026

[43] [43]

EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, and Chuxu Zhang. Evolver- outer: Co-evolving routing and prompt for multi-agent question answering.arXiv preprint arXiv:2604.05149, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

Jiatan Huang, Zheyuan Zhang, Tianyi Ma, Mingchen Li, Yaning Zheng, Yanfang Ye, and Chuxu Zhang. Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026

work page arXiv 2026

[45] [45]

Drift-bench: Diagnosing cooperative breakdowns in llm agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, and Yanfang Ye. Drift-bench: Diagnosing cooperative breakdowns in llm agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026

work page arXiv 2026

[46] [46]

Entropy-gated selective policy optimization: Token-level gradient allocation for hybrid training of large language models.arXiv preprint arXiv:2602.03309, 2026

Yuelin Hu, Zhengxue Cheng, Wei Liu, and Li Song. Entropy-gated selective policy optimization: Token-level gradient allocation for hybrid training of large language models.arXiv preprint arXiv:2602.03309, 2026

work page arXiv 2026

[47] [47]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.ArXiv, abs/2303.08896,

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

URLhttps://api.semanticscholar.org/CorpusID:257557820

work page

[50] [50]

Luq: Long-text uncertainty quantification for llms.ArXiv, abs/2403.20279, 2024

Caiqi Zhang, Fangyu Liu, Marco Basaldella, and Nigel Collier. Luq: Long-text uncertainty quantification for llms.ArXiv, abs/2403.20279, 2024. URL https://api.semanticscholar. org/CorpusID:268793903

work page arXiv 2024

[51] [51]

Hashimoto

Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, and Tatsunori B. Hashimoto. Graph-based uncertainty metrics for long-form language model outputs.ArXiv, abs/2410.20783,

work page arXiv

[52] [52]

URLhttps://api.semanticscholar.org/CorpusID:273654396

work page

[53] [53]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

work page 2024

[54] [54]

Gvpo: Group variance policy optimization for large language model post-training

Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, and Hui Xiong. Gvpo: Group variance policy optimization for large language model post-training. arXiv preprint arXiv:2504.19599, 2025

work page arXiv 2025

[55] [55]

A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

Xiaolong Han, Zehong Wang, Bo Zhao, Binchi Zhang, Jundong Li, Damian Borth, Rose Yu, Haggai Maron, Yanfang Ye, Lu Yin, et al. A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026

work page arXiv 2026

[56] [56]

On the plasticity and stability for post-training large language models.arXiv preprint arXiv:2602.06453, 2026

Wenwen Qiang, Ziyin Gu, Jiahuan Zhou, Jie Hu, Jingyao Wang, Changwen Zheng, and Hui Xiong. On the plasticity and stability for post-training large language models.arXiv preprint arXiv:2602.06453, 2026

work page arXiv 2026

[57] [57]

Rence: Learning to reason by noise contrastive estimation

Wenzheng Zhang and Karl Stratos. Rence: Learning to reason by noise contrastive estimation. arXiv preprint arXiv:2601.22432, 2026

work page arXiv 2026

[58] [58]

Q-hawkeye: Reliable visual policy optimization for image quality assessment.arXiv preprint arXiv:2601.22920, 2026

Wulin Xie, Rui Dai, Ruidong Ding, Kaikui Liu, Xiangxiang Chu, Xinwen Hou, and Jie Wen. Q-hawkeye: Reliable visual policy optimization for image quality assessment.arXiv preprint arXiv:2601.22920, 2026

work page arXiv 2026

[59] [59]

Ratio-variance regularized policy optimization for efficient llm fine-tuning.arXiv preprint arXiv:2601.03320, 2026

Yu Luo, Shuo Han, Yihan Hu, Dong Li, and Jianye Hao. Ratio-variance regularized policy optimization for efficient llm fine-tuning.arXiv preprint arXiv:2601.03320, 2026

work page arXiv 2026

[60] [60]

Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

Kangda Wei and Ruihong Huang. Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026

work page arXiv 2026

[61] [61]

Can llms guide their own exploration? gradient-guided reinforcement learning for llm reasoning.arXiv preprint arXiv:2512.15687, 2025

Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, and Dong Yu. Can llms guide their own exploration? gradient-guided reinforcement learning for llm reasoning.arXiv preprint arXiv:2512.15687, 2025

work page arXiv 2025

[62] [62]

Grpo-lambda: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025

Prasanna Parthasarathi, Mathieu Reymond, Boxing Chen, Yufei Cui, and Sarath Chandar. Grpo-lambda: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025. 13 A Related Work A.1 Uncertainty Estimation for Generation and Reasoning. Large language models (LLMs) have advanced rapidly in recent years [1, 33–39]. Building on this progress...

work page arXiv 2025