pith. sign in

arxiv: 2605.19344 · v1 · pith:QBGEOQNQnew · submitted 2026-05-19 · 💻 cs.CL

Retrieval-Augmented Linguistic Calibration

Pith reviewed 2026-05-20 05:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords linguistic calibrationretrieval-augmented generationfaithfulness divergenceLLM uncertaintyconfidence expressionquestion answeringmodel calibrationnatural language rewriting
0
0 comments X p. Extension
pith:QBGEOQNQ Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{QBGEOQNQ}

Prints a linked pith:QBGEOQNQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

RALC improves in-domain faithfulness and calibration up to 66% and 58% by rewriting LLM outputs with retrieval-augmented calibrated linguistic expressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats linguistic confidence expressions as a distribution over possible audience interpretations rather than a single scalar value. This framework lets the authors define faithfulness as how well those expressions prepare an audience for the true outcome, measured by a new information-theoretic quantity called Faithfulness Divergence. They then introduce Retrieval-Augmented Linguistic Calibration, a post-hoc pipeline that retrieves relevant passages and rewrites model answers so the inserted phrases such as 'probably' or 'I believe' better reflect actual correctness rates. Experiments across three QA benchmarks and five LLM families show consistent gains over prior black-box and grey-box calibration techniques.

Core claim

By modeling linguistic confidence as a distribution over plausible perceived probabilities and using retrieval to guide rewriting, calibrated signals can be propagated back into natural language outputs; this yields up to 66% better faithfulness and 58% better calibration than existing methods while preserving the original response style.

What carries the argument

Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that retrieves relevant contexts and rewrites LLM responses to insert calibrated linguistic confidence expressions.

If this is right

  • Linguistic expressions of uncertainty can be calibrated after generation without retraining or access to model internals.
  • Faithfulness Divergence supplies an evaluation axis that captures audience belief updating beyond standard expected calibration error.
  • The same retrieval-rewriting pipeline outperforms both black-box and grey-box baselines on in-domain QA tasks.
  • Improvements appear across multiple LLM families, indicating the approach does not depend on any single model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the rewriting step can be made robust to noisy retrieval sources, the method could extend to open-domain generation where external knowledge is imperfect.
  • Treating confidence as a distribution rather than a point estimate may help downstream systems that need to aggregate or compare uncertainty across multiple statements.
  • The faithfulness lens could be applied to other uncertainty cues such as hedging in reasoning traces or disclaimers in long-form answers.

Load-bearing premise

Retrieval-augmented rewriting can reliably insert calibrated confidence signals without introducing new factual errors or semantic distortions that offset the reported gains.

What would settle it

If side-by-side human or automated evaluation on the same questions shows that RALC-rewritten answers contain more factual inaccuracies or altered meanings than the original LLM outputs, the net benefit of the calibration gains would be overturned.

Figures

Figures reproduced from arXiv: 2605.19344 by Chang Xu, Jialin Yu, Linwei Tao, Minjing Dong, Philip Torr, Tao Huang, Yi-Fan Yeh.

Figure 1
Figure 1. Figure 1: Retrieval-Augmented Linguistic Calibration pipeline overview. In each calibration inference [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pre-calibration→post-calibration changes in generalised ECE and Faithfulness Divergence across signal space (top) and linguistic space (bottom), averaged across MMLU, SQuAD 2.0, and TruthfulQA. Our RALC consistently improves (reduces) both metrics across all confidence signals in both spaces. formats. We elicit responses using the Direct QA and Hedged QA prompt templates based on Yona et al. [8]’s work. Di… view at source ↗
Figure 3
Figure 3. Figure 3: Calibration effectiveness and quality comparison between RALC (averaged across all [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Faithfulness Divergence, KL divergence, expected Brier score, and expected NLL under [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Faithfulness Divergence, KL divergence, expected Brier score, and expected NLL under [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LLM vs. human perceived linguistic confidence on the human-annotated benchmark of [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sample hedging expressions from the lexicon, with their corresponding Beta distributions [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of the choice of k for the KNN retrieval of hedging expressions in RALC pipeline on Faithfulness Divergence and generalised ECE across different confidence signals for Llama-3.1-8B￾Instruct on the TruthfulQA dataset. The results show that both metrics are not highly sensitive to the choice of k within a reasonable range, with k = 5 showing consistently better marginal performance in the exploration … view at source ↗
Figure 9
Figure 9. Figure 9: For each confidence signal for Direct QA responses, we plot the distribution of the standard [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: We evaluate the quality of RALC by measuring the correlation between the calibrated [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mean confidence vs. mean accuracy per (dataset, model) pair. All datasets are system [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Miscalibration bias difference | ¯btrain−¯btest| vs. cross-domain advantage (ECEin−ECEcross). Colour indicates the test dataset. Transfer pairs with similar miscalibration biases achieve performance closer to in-domain calibration. This pattern follows from the learning dynamics of the calibration map. When a target domain has a weak bias, the in-domain calibrator has little signal to learn from and fits … view at source ↗
Figure 13
Figure 13. Figure 13: In-domain calibration reliability diagrams for MMLU across confidence signals and [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: In-domain calibration reliability diagrams for TruthfulQA across confidence signals and [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
read the original abstract

Linguistic cues such as "I believe" and "probably" offer an intuitive interface for communicating confidence, yet a generalisable, principled calibration framework for linguistic confidence expressions remains underexplored. In particular, co-occurring linguistic cues, contextual variation, and subjective audience interpretation pose unique challenges. We therefore model linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing interpretation variability that scalar representations discard. Within this distributional framework, we introduce faithfulness as a complementary evaluation dimension and present Faithfulness Divergence (FD), an information-theoretic metric quantifying the surprise induced in audience beliefs upon truth revelation. Building on these foundations, we present Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that propagates calibrated confidence signals back into natural language via retrieval-augmented rewriting. Across three QA benchmarks and five LLM families, RALC improves in-domain faithfulness and calibration up to 66% and 58%, respectively, outperforming black-box and grey-box calibration baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a distributional framework for modeling linguistic confidence expressions in LLM outputs as distributions over perceived probabilities, capturing interpretive variability. It defines Faithfulness Divergence (FD) as an information-theoretic metric to quantify the surprise in audience beliefs upon truth revelation. Building on this, the paper proposes Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that employs retrieval-augmented rewriting to embed calibrated confidence signals into natural language statements. Experiments on three QA benchmarks across five LLM families report improvements in in-domain faithfulness and calibration of up to 66% and 58%, respectively, outperforming black-box and grey-box baselines.

Significance. If the central results hold after addressing the noted concerns, this work would contribute a principled approach to linguistic calibration that moves beyond scalar probabilities to handle co-occurring cues and audience interpretation. The introduction of FD provides a novel evaluation dimension, and RALC offers a practical, retrieval-based method for post-hoc improvement without retraining. The multi-benchmark, multi-model evaluation strengthens the empirical case for applicability in QA settings, potentially aiding more trustworthy human-AI interactions if semantic fidelity is confirmed.

major comments (2)
  1. [§4] §4 (RALC pipeline): The description of the retrieval-augmented rewriting step does not include quantitative controls or metrics for semantic and factual fidelity of the generated rewrites (e.g., no entailment scores, NLI checks, or human evaluations of meaning preservation). This is load-bearing for the central claim, as any introduced distortions or scope changes in the statements could artifactually inflate the reported Faithfulness Divergence reductions and calibration gains rather than reflecting genuine propagation of calibrated signals.
  2. [§5] §5 (Experiments): While headline improvements of up to 66% faithfulness and 58% calibration are reported, the section provides insufficient detail on baseline implementations (black-box and grey-box methods), including exact prompting strategies, hyperparameter settings, or statistical tests with error bars. This undermines assessment of whether the gains are robust and reproducible across the three QA benchmarks and five LLM families.
minor comments (2)
  1. [§3] Notation for the distributional confidence model in §3 could be clarified with an explicit example of how a sample statement maps to its probability distribution to aid reader comprehension.
  2. Figure captions in the experimental results section would benefit from more explicit descriptions of what each panel shows, particularly regarding in-domain vs. out-of-domain splits.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting the potential impact of our distributional framework and RALC pipeline. We address each major comment below and will incorporate revisions to enhance the manuscript's rigor and reproducibility.

read point-by-point responses
  1. Referee: [§4] §4 (RALC pipeline): The description of the retrieval-augmented rewriting step does not include quantitative controls or metrics for semantic and factual fidelity of the generated rewrites (e.g., no entailment scores, NLI checks, or human evaluations of meaning preservation). This is load-bearing for the central claim, as any introduced distortions or scope changes in the statements could artifactually inflate the reported Faithfulness Divergence reductions and calibration gains rather than reflecting genuine propagation of calibrated signals.

    Authors: We agree that explicit quantitative controls for semantic and factual fidelity are essential to substantiate that improvements arise from calibrated signals rather than meaning alterations. The RALC design retrieves from a curated corpus of verified statements to minimize drift, but the main text indeed omits NLI-based metrics or human evaluations of preservation. In the revised version, we will add a dedicated fidelity analysis subsection reporting entailment scores from a standard NLI model (e.g., average and distribution across rewrites), plus a small-scale human evaluation of meaning preservation on a sample of outputs. This will directly confirm that rewrites maintain core semantics while embedding calibrated expressions. revision: yes

  2. Referee: [§5] §5 (Experiments): While headline improvements of up to 66% faithfulness and 58% calibration are reported, the section provides insufficient detail on baseline implementations (black-box and grey-box methods), including exact prompting strategies, hyperparameter settings, or statistical tests with error bars. This undermines assessment of whether the gains are robust and reproducible across the three QA benchmarks and five LLM families.

    Authors: We concur that greater detail on baselines is needed for reproducibility and to demonstrate robustness. The original §5 and appendix describe the methods at a conceptual level with example prompts, but lack exhaustive templates, hyperparameter values, and statistical analysis. We will revise §5 to include: (i) verbatim prompting templates for all black-box and grey-box baselines, (ii) full hyperparameter specifications (e.g., generation temperature, retrieval k, model versions), and (iii) error bars from multiple seeds with paired statistical tests (e.g., t-tests or Wilcoxon) across the three benchmarks and five LLMs. These additions will enable independent verification of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework and empirical claims are self-contained

full rationale

The paper defines a distributional model of linguistic confidence, introduces Faithfulness Divergence as an information-theoretic metric, and describes RALC as a post-hoc retrieval-augmented rewriting pipeline. No equations, derivations, or fitted parameters are presented that reduce claimed improvements to inputs by construction. Reported gains (up to 66% faithfulness, 58% calibration) are positioned as empirical outcomes across benchmarks and LLM families rather than tautological predictions. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify core choices. The derivation chain therefore contains independent content and does not collapse to self-definition or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly relies on the existence of a suitable retrieval corpus and on the assumption that rewriting preserves factual content.

pith-pipeline@v0.9.0 · 5707 in / 1098 out tokens · 51258 ms · 2026-05-20T05:43:30.382647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 5 internal anchors

  1. [1]

    What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, 2025

    Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W Mayer, and Padhraic Smyth. What large language models know and what people think they know.Nature Machine Intelligence, 7(2):221–231, 2025

  2. [2]

    i’m not sure, but

    Sunnie SY Kim, Q Vera Liao, Mihaela V orvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan. " i’m not sure, but...": Examining the impact of large language models’ uncertainty expression on user reliance and trust. InProceedings of the 2024 ACM conference on fairness, accountability, and transparency, pages 822–835, 2024

  3. [3]

    Semantic-level confidence cali- bration of language models via temperature scaling

    Tom A Lamb, Desi R Ivanova, Philip Torr, and Tim GJ Rudner. Semantic-level confidence cali- bration of language models via temperature scaling. InICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI, 2025

  4. [4]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

  5. [5]

    Teaching Models to Express Their Uncertainty in Words

    Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.arXiv preprint arXiv:2205.14334, 2022

  6. [6]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages ...

  7. [7]

    Ubiquitous log odds: a common representation of proba- bility and frequency distortion in perception, action, and cognition.Frontiers in neuroscience, 6:1, 2012

    Hang Zhang and Laurence T Maloney. Ubiquitous log odds: a common representation of proba- bility and frequency distortion in perception, action, and cognition.Frontiers in neuroscience, 6:1, 2012

  8. [8]

    Gal Yona, Roee Aharoni, and Mor Geva. Can large language models faithfully express their intrinsic uncertainty in words? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7752–7764, 2024

  9. [9]

    Revisiting un- certainty estimation and calibration of large language models.arXiv preprint arXiv:2505.23854, 2025

    Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, and Chang Xu. Revisiting un- certainty estimation and calibration of large language models.arXiv preprint arXiv:2505.23854, 2025

  10. [10]

    Can large language models express uncertainty like human?arXiv preprint arXiv:2509.24202, 2025

    Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A Lamb, Jialin Yu, Philip HS Torr, and Chang Xu. Can large language models express uncertainty like human?arXiv preprint arXiv:2509.24202, 2025

  11. [11]

    Evidential deep learning to quantify classification uncertainty.Advances in neural information processing systems, 31, 2018

    Murat Sensoy, Lance Kaplan, and Melih Kandemir. Evidential deep learning to quantify classification uncertainty.Advances in neural information processing systems, 31, 2018

  12. [12]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR, 2017

  13. [13]

    Calibrating expressions of certainty.arXiv preprint arXiv:2410.04315, 2024

    Peiqi Wang, Barbara D Lam, Yingcheng Liu, Ameneh Asgari-Targhi, Rameswar Panda, William M Wells, Tina Kapur, and Polina Golland. Calibrating expressions of certainty.arXiv preprint arXiv:2410.04315, 2024

  14. [14]

    Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly weather review, 78(1):1–3, 1950

  15. [15]

    John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

  16. [16]

    Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers

    Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. InICML, volume 1, page 2001, 2001

  17. [17]

    Transforming classifier scores into accurate multiclass probability estimates

    Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699, 2002. 11

  18. [18]

    Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

    Meelis Kull, Telmo Silva Filho, and Peter Flach. Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers. InArtificial intelligence and statistics, pages 623–631. PMLR, 2017

  19. [19]

    Distribution calibration for regression

    Hao Song, Tom Diethe, Meelis Kull, and Peter Flach. Distribution calibration for regression. In International Conference on Machine Learning, pages 5897–5906. PMLR, 2019

  20. [20]

    Calibration by distribution matching: Trainable kernel calibration metrics.Advances in Neural Information Processing Systems, 36:25910–25928, 2023

    Charlie Marx, Sofian Zalouk, and Stefano Ermon. Calibration by distribution matching: Trainable kernel calibration metrics.Advances in Neural Information Processing Systems, 36:25910–25928, 2023

  21. [21]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  22. [22]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  23. [23]

    Know what you don’t know: Unanswerable ques- tions for squad

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable ques- tions for squad. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018

  24. [24]

    Truthfulqa: Measuring how models mimic hu- man falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic hu- man falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022

  25. [25]

    Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models

    Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5050–5063, 2024

  26. [26]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  27. [27]

    Calibrating long-form generations from large language models

    Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, and Bhuwan Dhingra. Calibrating long-form generations from large language models. InFindings of the association for computational linguistics: EMNLP 2024, pages 13441–13460, 2024

  28. [28]

    On the calibration of large language models and alignment

    Chiwei Zhu, Benfeng Xu, Quan Wang, Yongdong Zhang, and Zhendong Mao. On the calibration of large language models and alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9778–9795, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/...

  29. [29]

    An entropic metric for measuring calibration of machine learning models.arXiv preprint arXiv:2502.14545, 2025

    Daniel James Sumler, Lee Devlin, Simon Maskell, and Richard O Lane. An entropic metric for measuring calibration of machine learning models.arXiv preprint arXiv:2502.14545, 2025

  30. [30]

    Extending confidence calibration to generalised measures of variation.arXiv preprint arXiv:2602.12975, 2026

    Andrew Thompson and Vivek Desai. Extending confidence calibration to generalised measures of variation.arXiv preprint arXiv:2602.12975, 2026

  31. [31]

    Calibrating verbal uncertainty as a linear feature to reduce hallucinations

    Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, and Nicola Cancedda. Calibrating verbal uncertainty as a linear feature to reduce hallucinations. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3769–3793, 2025

  32. [32]

    Bayesian surprise attracts human attention.Vision research, 49 (10):1295–1306, 2009

    Laurent Itti and Pierre Baldi. Bayesian surprise attracts human attention.Vision research, 49 (10):1295–1306, 2009

  33. [33]

    Determining the effective sample size of a parametric prior.Biometrics, 64(2):595–602, 2008

    Satoshi Morita, Peter F Thall, and Peter Müller. Determining the effective sample size of a parametric prior.Biometrics, 64(2):595–602, 2008. 12

  34. [34]

    Introducing Claude Sonnet 4.6

    Anthropic. Introducing Claude Sonnet 4.6. https://www.anthropic.com/news/ claude-sonnet-4-6, February 2026

  35. [35]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026

  36. [36]

    Gemini 3 Flash: Best for frontier intelligence at speed.https://deepmind

    Google DeepMind. Gemini 3 Flash: Best for frontier intelligence at speed.https://deepmind. google/models/gemini/flash/, 2025

  37. [37]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  38. [38]

    Llama 3.1 model card and prompt formats

    Meta AI. Llama 3.1 model card and prompt formats. https://www.llama.com/docs/ model-cards-and-prompt-formats/llama3_1/, 2024

  39. [39]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  40. [40]

    Mistral 7B

    Mistral AI Team. Mistral 7B. https://mistral.ai/news/announcing-mistral-7b, September 2023

  41. [41]

    Gemma 4: Our most intelligent open models

    Google DeepMind. Gemma 4: Our most intelligent open models. https://deepmind. google/models/gemma/gemma-4/, 2025

  42. [42]

    I don’t know the answer

    Solomon Kullback and Richard A Leibler. On information and sufficiency.The Annals of Mathematical Statistics, 22(1):79–86, 1951. 13 A Theory A.1 Beta distribution estimation The Beta distribution is a natural choice for modelling random variables supported on [0,1] , such as confidence scores and empirical accuracies. A Beta distribution with parameters (...

  43. [43]

    There is a correlation between X and Y

    are provided in-context to anchor model ratings to human perception. Each evaluator scores every sentence 3 times (temperature = 1), yielding up to 20×3×3 = 180 raw scores per hedging expression. Non-verifiable template sentences "There is a correlation between X and Y ." "It rains tomorrow." "The experiment shows a significant effect." "The new policy im...