arxiv: 2605.14098 · v1 · submitted 2026-05-13 · 📊 stat.ML · cs.CL· cs.LG

Recognition: no theorem link

Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning

Yu Gu , Zijun Yu , Vahid Partovi Nia , Masoud Asgharian

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:15 UTC · model grok-4.3

classification 📊 stat.ML cs.CLcs.LG

keywords conformal risk controlchain-of-thoughtabstentionselective accuracyself-consistencyconfident error rateinference-timeaggregation

0 comments

The pith

Conformal aggregation gives finite-sample guarantees on confident-error rates for chain-of-thought reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper replaces majority voting in chain-of-thought reasoning with weighted aggregation of multiple reasoning paths. It calibrates an abstention threshold using conformal risk control to bound the probability of giving a wrong answer. The guarantees hold with finite samples and require no retraining. Under the condition that scores separate correct from incorrect paths, abstaining improves accuracy on the answered questions. Experiments across benchmarks confirm the error rates stay within targets while boosting selective accuracy.

Core claim

The central discovery is that conformal risk control applied to weighted score aggregates over chains-of-thought paths controls the confident-error rate at any desired level with finite-sample validity. Score separability ensures that the resulting selective accuracy exceeds the baseline, with closed-form formulas predicting the gain from calibration data alone.

What carries the argument

Weighted score aggregation combined with conformal risk control for setting the abstention threshold on chain-of-thought reasoning paths.

If this is right

Confident-error rates remain consistent with user-specified targets across four benchmarks and multiple models.
Selective accuracy improves to 90.1 percent on GSM8K by abstaining on fewer than 5 percent of problems.
Closed-form expressions allow prediction of accuracy gains using only calibration data.
The method applies at inference time to existing models without any retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If score separability is common in practice, this approach could be wrapped around other aggregation-based LLM systems to add reliability guarantees.
Practitioners could use the closed-form predictions to decide in advance how much calibration data to collect for a target accuracy level.
Similar conformal techniques might address uncertainty in other multi-path reasoning or ensemble methods beyond CoT.
The finite-sample nature makes it suitable for small calibration sets common in applied settings.

Load-bearing premise

The scores for correct reasoning paths must be distinguishable from those of incorrect paths so that the aggregation reliably ranks them higher.

What would settle it

Observing a confident-error rate significantly above the calibrated target on new test data, beyond what calibration and test variability would explain, would falsify the guarantee.

Figures

Figures reproduced from arXiv: 2605.14098 by Masoud Asgharian, Vahid Partovi Nia, Yu Gu, Zijun Yu.

**Figure 1.** Figure 1: Pause and Reflect. Left: The system commits to the majority answer even when it is wrong and right answer presents. Right (Ours): Rather than committing unconditionally, the system pauses: it aggregates paths by weighted voting to form a confidence measure, then reflects on whether that confidence clears a calibrated threshold λˆ. If it does, the system answers; if not, it abstains. from exploration, searc… view at source ↗

**Figure 2.** Figure 2: Confident-error rate control (Theorem 2). Top: Gap between the realized confident-error rate and the target α across four representative configurations and all three score families. Bottom: Fraction of questions answered correctly (green) or incorrectly (red) as α varies. On easier tasks yield remains high even at strict α; on harder tasks the system abstains more aggressively. 0.4 0.6 0.8 λ 0.75 0.80 0.85… view at source ↗

**Figure 3.** Figure 3: Predicted versus observed selective accuracy. Solid curves show the observed selective accuracy Ac on held-out test data; dashed curves show the calibration-set plug-in predictor Abc (Corollary 3.2). Vertical lines mark the CRC-calibrated λˆ at α = 0.10. Horizontal baselines indicate majority-vote accuracy pv (dashed) and best-of-m accuracy (dotted). on configurations with strong separability and high yiel… view at source ↗

**Figure 4.** Figure 4: Accuracy–yield frontiers and separability profiles. Top: Selective accuracy Ac versus yield for each score family. Higher and further right is better. Bottom: Separability gap ∆(λ) (solid) and strict separability δ(λ) (dashed) for the Reward score. 4 8 12 16 20 Number of drafts m 0.5 0.6 0.7 0.8 0.9 1.0 Metric (a) Draft count sweep pv Ac Yield 100 200 300 400 500 Calibration size ncal 0.0 0.2 0.4 0.6 0.8 1… view at source ↗

**Figure 5.** Figure 5: Ablation summary. (a) Sweep on the number of sampled paths m. (b) Sweep on calibration sizes ncal ∈ {50, . . . , 500}. (c) Sweep on the weight choice. (d) Frontier AUC across all ten configurations. 4.5 Ablations We ablate three design choices on a representative setting (GSM8K, Qwen3-1.7B, Reward score, α=0.10); full tables appear in Appendix F [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: extends the bottom row of [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗

**Figure 7.** Figure 7: Predicted (dashed) vs. observed (solid) selective accuracy [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗

**Figure 8.** Figure 8: extends [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗

read the original abstract

Chain-of-thought (CoT) reasoning with self-consistency improves performance by aggregating multiple sampled reasoning paths. In this setting, correctness is no longer tied to a single reasoning trace but to the aggregation rule over a pool of candidate paths, making aggregation uncertainty the central challenge. This issue is critical where confidently incorrect answers are far more costly than abstentions. We introduce a conformal procedure for CoT reasoning that directly addresses aggregation uncertainty. Our approach replaces majority voting with weighted score aggregation over reasoning paths and calibrates an abstention rule using conformal risk control. This approach leads to finite-sample guarantees on the confident-error rate--the probability that the system answers and is wrong. We further identify score separability as the key condition under which abstention provably improves selective accuracy, and derive closed-form expressions that predict accuracy gains from calibration data alone. The method is fully inference-time, and requires no retraining. Across four benchmarks, four open-source models, and three score classes, realized confident-error rates are consistent with the prescribed targets up to calibration-split and test-set variability. Our method achieves $90.1\%$ selective accuracy on GSM8K by abstaining on less than $5\%$ of problems, compared with $82\%$ accuracy under majority-voting baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Conformal aggregation on CoT paths gives usable finite-sample error control but the closed-form accuracy gains rest on an unchecked separability assumption.

read the letter

The core contribution is a conformal risk control wrapper around weighted aggregation of multiple CoT traces. Instead of majority vote they score paths, calibrate an abstention threshold on a held-out split, and obtain a finite-sample bound on the probability that the system answers and is wrong. They also derive closed-form expressions that predict how much selective accuracy should rise once the threshold is set, provided the weighted scores separate correct from incorrect paths. The experiments report realized error rates close to the targets across four benchmarks and models, with one headline result of 90.1% selective accuracy on GSM8K while abstaining on under 5% of problems. That is the practical payoff: inference-time control without retraining. The finite-sample guarantee itself is standard conformal risk control under exchangeability, so it holds as long as the calibration and test problems are exchangeable. The closed-form accuracy predictions, however, require score separability, and the paper does not report a direct check such as ROC-AUC of the weighted score or a Kolmogorov-Smirnov test between the two distributions. Without that, it is unclear whether the observed gains are driven by the predicted mechanism or simply by the weighting rule. The math and citation pattern look clean; the method is a straightforward extension rather than a reinvention. This is useful for anyone building high-stakes LLM pipelines where abstention is cheaper than a confident error. It is worth sending to referees because the guarantees are real and the empirical numbers are concrete, even if the separability claim needs tighter validation in revision.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces a conformal procedure for chain-of-thought (CoT) reasoning that replaces majority voting with weighted score aggregation over multiple reasoning paths and uses conformal risk control to calibrate an abstention rule. This provides finite-sample guarantees on the confident-error rate (probability that the system answers and is wrong). The paper identifies score separability as the condition for provable accuracy gains and derives closed-form expressions predicting these gains from calibration data. Experiments across four benchmarks, four models, and three score classes show realized error rates consistent with targets, achieving 90.1% selective accuracy on GSM8K with <5% abstention.

Significance. The finite-sample guarantees are a notable strength, as they follow from standard conformal risk control under exchangeability and do not require retraining. If the closed-form predictions hold, this could enable reliable selective accuracy improvements in CoT systems. The empirical consistency with targets across settings supports practical utility, though the attribution to the separability mechanism requires further validation.

major comments (1)

[Abstract and theoretical derivation] The finite-sample guarantee on the confident-error rate follows directly from conformal risk control and is independent of score separability. However, the additional claim of closed-form expressions that predict accuracy gains from calibration data alone requires the assumption of score separability (correct-answer scores stochastically dominate incorrect ones). The manuscript reports no direct empirical check of this separability on the benchmarks (e.g., ROC-AUC of the weighted score as a correctness predictor or Kolmogorov-Smirnov statistic between score distributions), which is needed to substantiate that the observed 90.1% selective accuracy on GSM8K results from the predicted mechanism rather than incidental effects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript. We address the major comment point-by-point below and have incorporated revisions where appropriate to strengthen the presentation of our results.

read point-by-point responses

Referee: The finite-sample guarantee on the confident-error rate follows directly from conformal risk control and is independent of score separability. However, the additional claim of closed-form expressions that predict accuracy gains from calibration data alone requires the assumption of score separability (correct-answer scores stochastically dominate incorrect ones). The manuscript reports no direct empirical check of this separability on the benchmarks (e.g., ROC-AUC of the weighted score as a correctness predictor or Kolmogorov-Smirnov statistic between score distributions), which is needed to substantiate that the observed 90.1% selective accuracy on GSM8K results from the predicted mechanism rather than incidental effects.

Authors: We thank the referee for highlighting this important distinction. The finite-sample guarantee on the confident-error rate is indeed a direct consequence of conformal risk control and holds under the standard exchangeability assumption without requiring score separability. The closed-form expressions for predicting accuracy gains, however, are explicitly derived under the score separability condition, as noted in the theoretical section of the manuscript. To substantiate the mechanism, we have added empirical validations of score separability in the revised version. Specifically, we report ROC-AUC scores for the weighted aggregation score as a binary classifier for correctness, as well as Kolmogorov-Smirnov statistics and p-values comparing the score distributions for correct versus incorrect answers across all benchmarks and models. These analyses confirm that separability holds in our experimental settings, with high ROC-AUC values (typically >0.85) and statistically significant differences in distributions, thereby supporting that the observed selective accuracy improvements, such as the 90.1% on GSM8K, arise from the predicted separability-based mechanism rather than incidental factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The finite-sample guarantees on confident-error rate follow directly from standard conformal risk control under exchangeability of calibration and test problems, which is an external property independent of the paper's fitted values or separability assumption. The closed-form expressions for accuracy gains are mathematically derived from calibration data under the explicit score-separability condition; they do not reduce to the inputs by construction but instead apply a probabilistic model to those data. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the derivation. The method is fully specified at inference time with no retraining, rendering the core claims self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard conformal prediction assumptions of exchangeability between calibration and test points plus the newly stated score-separability condition; no free parameters or invented entities are introduced beyond the score function itself.

axioms (2)

domain assumption Exchangeability of calibration and test reasoning paths
Required for finite-sample coverage guarantees in conformal risk control; invoked when calibrating the abstention threshold.
ad hoc to paper Score separability between correct and incorrect paths
Stated as the key condition for provable accuracy improvement from abstention; appears in the derivation of closed-form expressions.

pith-pipeline@v0.9.0 · 5540 in / 1310 out tokens · 38595 ms · 2026-05-15T02:15:33.762118+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 6 internal anchors

[1]

Scaling Test-Time Compute for Agentic Coding

Scaling Test-Time Compute for Agentic Coding , author=. arXiv preprint arXiv:2604.16529 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year =

Chain-of-Thought Reasoning Without Prompting , author =. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year =

work page
[3]

Reward-guided speculative de- coding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

Reward-Guided Speculative Decoding for Efficient LLM Reasoning , author =. arXiv preprint arXiv:2501.19324 , year =

work page arXiv
[4]

Efficient Reasoning for

Wang, Jikai and Li, Juntao and Wu, Lijun and Zhang, Min , journal =. Efficient Reasoning for

work page
[5]

Junhan Shi and Yijia Zhu and Zhenning Shi and Dan Zhao and Qing Li and Yong Jiang , booktitle =

work page
[6]

CoRR , volume =

Xinyang Hu and Fengzhuo Zhang and Siyu Chen and Zhuoran Yang , title =. CoRR , volume =. 2024 , cdate =

work page 2024
[7]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Feng, Guhao and Zhang, Bohang and Gu, Yuntian and Ye, Haotian and He, Di and Wang, Liwei , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

work page 2023
[8]

2025 , eprint =

CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision , author =. 2025 , eprint =

work page 2025
[9]

Understanding Chain-of-Thought in

Jean-Francois Ton and Muhammad Faaiz Taufiq and Yang Liu , booktitle =. Understanding Chain-of-Thought in

work page
[10]

Findings of the Association for Computational Linguistics: EMNLP 2025 , month = nov, year =

Language Models Can Easily Learn to Reason from Demonstrations , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , month = nov, year =

work page 2025
[11]

2025 , eprint =

LIMO: Less is More for Reasoning , author =. 2025 , eprint =

work page 2025
[12]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =. doi:10.18653/v1/2023.acl-long.153 , pages =

work page doi:10.18653/v1/2023.acl-long.153 2023
[13]

Thirty-seventh Conference on Neural Information Processing Systems , year =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. Thirty-seventh Conference on Neural Information Processing Systems , year =

work page
[14]

2023 , booktitle =

Leviathan, Yaniv and Kalman, Matan and Matias, Yossi , title =. 2023 , booktitle =

work page 2023
[15]

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in

Xuan Zhang and Chao Du and Tianyu Pang and Qian Liu and Wei Gao and Min Lin , booktitle =. Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in

work page
[16]

The Eleventh International Conference on Learning Representations , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. The Eleventh International Conference on Learning Representations , year =

work page
[17]

The Thirteenth International Conference on Learning Representations , year =

Conformal Language Model Reasoning with Coherent Factuality , author =. The Thirteenth International Conference on Learning Representations , year =

work page
[18]

Mutual Reasoning Makes Smaller

Zhenting Qi and Mingyuan MA and Jiahang Xu and Li Lyna Zhang and Fan Yang and Mao Yang , booktitle =. Mutual Reasoning Makes Smaller

work page
[19]

2024 , booktitle =

Mohri, Christopher and Hashimoto, Tatsunori , title =. 2024 , booktitle =

work page 2024
[20]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year =

Large language model validity via enhanced conformal prediction methods , author =. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year =

work page
[21]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle =. Does Reinforcement Learning Really Incentivize Reasoning Capacity in

work page
[22]

Diversity of Thought Improves Reasoning Abilities of Large Language Models , journal =

Ranjita Naik and Varun Chandrasekaran and Mert Yuksekgonul and Hamid Palangi and Besmira Nushi , year=. Diversity of Thought Improves Reasoning Abilities of. 2310.07088 , archivePrefix=

work page arXiv
[23]

Advances in Neural Information Processing Systems , year =

Large Language Models are Zero-Shot Reasoners , author =. Advances in Neural Information Processing Systems , year =

work page
[24]

ACM Trans

Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , title =. ACM Trans. Inf. Syst. , month = jan, articleno =. 2025 , issue_date =

work page 2025
[25]

ACM Comput

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. ACM Comput. Surv. , month = mar, articleno =. 2023 , issue_date =

work page 2023
[26]

CoRR , _volume =

Timo Kaufmann and Paul Weng and Viktor Bengs and Eyke Hüllermeier , title =. CoRR , _volume =. 2023 , cdate =

work page 2023
[27]

Advances in neural information processing systems , volume =

Retrieval-augmented generation for knowledge-intensive nlp tasks , author =. Advances in neural information processing systems , volume =

work page
[28]

CoRR , _volume =

Yunfan Gao and Yun Xiong and Xinyu Gao and Kangxiang Jia and Jinliu Pan and Yuxi Bi and Yi Dai and Jiawei Sun and Qianyu Guo and Meng Wang and Haofen Wang , title =. CoRR , _volume =. 2023 , cdate =

work page 2023
[29]

Advances in neural information processing systems , volume =

Language models are few-shot learners , author =. Advances in neural information processing systems , volume =

work page
[30]

Algorithmic Learning in a Random World , journal =

Vovk, Vladimir and Gammerman, Alex and Shafer, Glenn , year =. Algorithmic Learning in a Random World , journal =

work page
[31]

Foundations and trends

Conformal prediction: A gentle introduction , author =. Foundations and trends. 2023 , publisher =

work page 2023
[33]

NeurIPS , year =

Measuring Mathematical Problem Solving With the MATH Dataset , author =. NeurIPS , year =

work page
[34]

2023 , eprint =

Let's Verify Step by Step , author =. 2023 , eprint =

work page 2023
[35]

CoRR , _volume =

Xiaotian Zhang and Chunyang Li and Yi Zong and Zhengyu Ying and Liang He and Xipeng Qiu , title =. CoRR , _volume =. 2023 , cdate =

work page 2023
[36]

American Mathematics Competitions , author =

work page
[37]

Conformal Risk Control , author =

work page
[38]

Conformal Language Modeling , author =

work page
[39]

, title =

Bates, Stephen and Angelopoulos, Anastasios and Lei, Lihua and Malik, Jitendra and Jordan, Michael I. , title =. 2021 , issue_date =. doi:10.1145/3478535 , journal =

work page doi:10.1145/3478535 2021
[40]

arXiv preprint arXiv:2001.09977 , year=

Towards a human-like open-domain chatbot , author=. arXiv preprint arXiv:2001.09977 , year=

work page arXiv 2001
[41]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Pinzheng Wang and ShuliXu and Juntao Li and Yu Luo and Dong Li and Jianye HAO and Min Zhang , year = 2026, booktitle =. \

work page 2026
[43]

Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , year = 2025, booktitle =. Scaling

work page 2025
[44]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author=

work page
[45]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Junpeng Yue and Zepeng Wang and Jiangxing Wang and Yuxuan Wang and Ziluo Ding and Sipeng Zheng and Xinrun Xu and Yu Zhang and Bin Cao and Zongqing Lu , year=

work page
[48]

7th Annual Conference on Robot Learning , year=

Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners , author=. 7th Annual Conference on Robot Learning , year=

work page
[49]

and Salakhutdinov, Ruslan and Manning, Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle=

work page
[50]

arXiv preprint arXiv:2405.01563 , year=

Mitigating llm hallucinations via conformal abstention , author=. arXiv preprint arXiv:2405.01563 , year=

work page arXiv
[51]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[52]

Index for rating diagnostic tests

Youden, W J. Index for rating diagnostic tests. Cancer

work page
[53]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Confidence improves self-consistency in llms , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[54]

Proceedings of the Asian Conference on Machine Learning , pages =

Conditional Validity of Inductive Conformal Predictors , author =. Proceedings of the Asian Conference on Machine Learning , pages =. 2012 , editor =

work page 2012
[55]

Barber, Rina Foygel and Candes, Emmanuel J and Ramdas, Aaditya and Tibshirani, Ryan J , journal=. De. 2024 , publisher=

work page 2024
[56]

Massart , title =

P. Massart , title =. The Annals of Probability , number =

work page
[57]

Statistics & Probability Letters , volume =

A. Statistics & Probability Letters , volume =. 1997 , issn =

work page 1997
[58]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[59]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy , author =. arXiv preprint arXiv:2507.01352 , year=

work page arXiv
[61]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Geifman, Yonatan and El-Yaniv, Ran , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

work page 2017
[62]

Confidence as a Reward: Transforming

Du, He and Li, Bowen and Xie, Chengxing and Gao, Chang and Chen, Kai and Tao, Dacheng , year =. Confidence as a Reward: Transforming

work page
[63]

Prune 'n Predict: Optimizing

Harit Vishwakarma and Alan Mishler and Thomas Cook and Niccolo Dalmasso and Natraj Raman and Sumitra Ganesh , booktitle=. Prune 'n Predict: Optimizing

work page
[64]

2024 , eprint=

MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline , author=. 2024 , eprint=

work page 2024
[65]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

arXiv preprint arXiv:2411.11824 , year=

Theoretical foundations of conformal prediction , author=. arXiv preprint arXiv:2411.11824 , year=

work page arXiv