pith. sign in

arxiv: 2502.14427 · v2 · submitted 2025-02-20 · 💻 cs.CL

Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

Pith reviewed 2026-05-23 02:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords uncertainty quantificationlarge language modelsMahalanobis distancetoken embeddingsselective generationfact-checkingtruthfulness
0
0 comments X

The pith

Adapting Mahalanobis distance to multi-layer token embeddings produces more accurate uncertainty scores for large language models than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts a density estimation technique previously used in classification to the setting of text generation by large language models. It pulls embeddings for every token from several layers, computes Mahalanobis distances on those vectors, and feeds the resulting features into a linear regression trained to predict uncertainty. The resulting scores are evaluated on eleven datasets for both deciding whether to output a full sequence and for checking individual claims. The approach beats existing information- and consistency-based uncertainty methods while remaining fast to compute and generalizing to data outside the training distribution. A reader would care because better uncertainty estimates let systems know when to trust or withhold model answers, which directly affects reliability in applications that require truthful output.

Core claim

We adapt Mahalanobis Distance for text generation by extracting token embeddings from multiple layers of LLMs, computing MD scores for each token, and using linear regression trained on these features to provide robust uncertainty scores. Through extensive experiments on eleven datasets, this approach substantially improves over existing UQ methods, providing accurate and computationally efficient uncertainty scores for both sequence-level selective generation and claim-level fact-checking tasks. Our method also exhibits strong generalization to out-of-domain data.

What carries the argument

Token-level Mahalanobis distance scores computed on embeddings from multiple LLM layers, then passed through a linear regression model to yield uncertainty estimates.

If this is right

  • Uncertainty scores can be used directly to decide whether to generate a full response or to abstain in sequence-level tasks.
  • The same scores support finer-grained decisions at the level of individual claims inside a longer answer.
  • The method remains effective on data that differs from the training distribution, reducing the need for per-domain retraining.
  • Computation stays cheaper than methods that require multiple forward passes or consistency checks across samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the scores truly reflect model-internal density rather than surface statistics, the same pipeline could be applied to other density-based measures beyond Mahalanobis distance.
  • Token-level granularity opens the possibility of editing or masking only the uncertain spans inside an otherwise reliable response.
  • Strong out-of-domain results suggest the regression may be learning properties of the model's embedding geometry that are stable across tasks.

Load-bearing premise

The linear regression model trained on Mahalanobis distance features from in-domain token embeddings will continue to give accurate uncertainty scores on out-of-domain data without retraining.

What would settle it

Apply the trained regression to a new dataset drawn from a markedly different distribution and measure whether selective-generation accuracy or fact-checking F1 falls below the best baseline method.

Figures

Figures reproduced from arXiv: 2502.14427 by Alexander Panchenko, Artem Shelmanov, Artem Vazhentsev, Ivan Lazichny, Lyudmila Rvanova, Maxim Panov, Timothy Baldwin.

Figure 1
Figure 1. Figure 1: An illustration of the proposed method. After each decoder layer, the embeddings of each generated [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of embeddings from various layers in density-based scores. PRR [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dependency of PRR↑ of the SATRMD+MSP and HUQ-SATRMD methods on the correctness threshold for the embedding selection for the centroid and covariance matrix for MD for the Llama 8b v3.1 model. Higher values indicate better results. 2 5 10 20 30 Number of PCA Components 0.10 0.05 0.00 0.05 0.10 0.15 PRR XSum 2 5 10 20 30 Number of PCA Components 0.20 0.25 0.30 0.35 0.40 SamSum 2 5 10 20 30 Number of PCA Comp… view at source ↗
Figure 4
Figure 4. Figure 4: Dependency of PRR↑ of the SATRMD+MSP and HUQ-SATRMD methods on the number of the PCA components for the features of linear regression for the Llama 8b v3.1 model. Higher values indicate better results. the results with a threshold of 0.3 are significantly better than those with other thresholds. Specifi￾cally, lower thresholds (e.g., 0.1) result in select￾ing the embeddings corresponding to incorrect in￾st… view at source ↗
Figure 5
Figure 5. Figure 5: presents the results when varying the size of the training dataset for the supervised methods. We train the linear regression model on the training datasets of size: 100, 200, 500, 1000, 2000, and additionally on a training dataset of 5000 instances for SciQ and MMLU. Since the TruthfulQA dataset consists of only 817 instances, of which we use 409 instances as the test subset, we train linear regression on… view at source ↗
read the original abstract

Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs). To date, information-based and consistency-based UQ have been the dominant UQ methods for text generation via LLMs. Density-based methods, despite being very effective for UQ in text classification with encoder-based models, have not been very successful with generative LLMs. In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation and introduce a new supervised UQ method. Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores. Through extensive experiments on eleven datasets, we demonstrate that our approach substantially improves over existing UQ methods, providing accurate and computationally efficient uncertainty scores for both sequence-level selective generation and claim-level fact-checking tasks. Our method also exhibits strong generalization to out-of-domain data, making it suitable for a wide range of LLM-based applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper adapts Mahalanobis distance (MD) from classification to LLM text generation by extracting token embeddings from multiple layers, computing per-token MD scores, and training a linear regressor on these features to produce uncertainty scores. It claims this supervised density-based method substantially outperforms existing UQ approaches on eleven datasets for both sequence-level selective generation and claim-level fact-checking, while exhibiting strong out-of-domain generalization.

Significance. If the empirical claims hold under proper controls for baselines, metrics, and data overlap, the method could provide a more efficient alternative to consistency-based UQ for truthfulness elicitation in generative LLMs.

major comments (3)
  1. [Experiments] Experiments section: the central claim of substantial improvement on eleven datasets provides no information on baseline implementations, exact metrics used, statistical significance tests, or whether regressor training data overlaps with evaluation sets, leaving the empirical support for the main result under-specified.
  2. [Method] Method section: MD is computed w.r.t. the empirical mean and covariance of in-domain token embeddings; the linear regressor is fitted only on in-domain MD-to-label pairs. The claim of strong OOD generalization therefore requires explicit evidence that distribution shifts in embedding covariances across the eleven datasets are mild enough to be linearly corrected without retraining.
  3. [Abstract] Abstract and §3: the method is a fitted empirical predictor rather than an algebraically determined score; this supervised character should be stated when contrasting with 'density-based' UQ, and the reported gains must be shown to exceed what a simple supervised baseline on the same features would achieve.
minor comments (2)
  1. [Method] Clarify the exact procedure for aggregating multi-layer MD scores into the feature vector for the regressor.
  2. [Experiments] Add a table or section listing the eleven datasets with their domains and train/eval splits to support the OOD generalization claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We address each major comment below, agreeing where revisions are needed to improve clarity and rigor, and propose specific changes to the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim of substantial improvement on eleven datasets provides no information on baseline implementations, exact metrics used, statistical significance tests, or whether regressor training data overlaps with evaluation sets, leaving the empirical support for the main result under-specified.

    Authors: We agree that additional details are required to fully support the empirical claims. In the revised manuscript, we will expand the Experiments section to describe baseline implementations (including any hyperparameters and code references), specify the exact metrics (e.g., AUROC, AUPRC), report statistical significance tests (such as paired t-tests across runs), and clarify data splits to confirm no overlap between regressor training data and evaluation sets. revision: yes

  2. Referee: [Method] Method section: MD is computed w.r.t. the empirical mean and covariance of in-domain token embeddings; the linear regressor is fitted only on in-domain MD-to-label pairs. The claim of strong OOD generalization therefore requires explicit evidence that distribution shifts in embedding covariances across the eleven datasets are mild enough to be linearly corrected without retraining.

    Authors: We acknowledge this point and the need for explicit supporting evidence. While the current cross-dataset results already indicate effective generalization, we will add a new subsection with cross-dataset training experiments (training the regressor on subsets of the eleven datasets and evaluating on held-out ones) to demonstrate that the linear correction handles embedding covariance shifts without retraining. revision: partial

  3. Referee: [Abstract] Abstract and §3: the method is a fitted empirical predictor rather than an algebraically determined score; this supervised character should be stated when contrasting with 'density-based' UQ, and the reported gains must be shown to exceed what a simple supervised baseline on the same features would achieve.

    Authors: We agree that the supervised aspect merits clearer emphasis. Although the abstract already refers to a 'supervised UQ method,' we will revise §3 to explicitly contrast the fitted regressor with purely algebraic density-based scores and add experiments comparing against a simple supervised baseline (linear regression on raw token embeddings or layer-wise statistics) to confirm that the MD features drive the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity; standard supervised density-based UQ pipeline

full rationale

The paper adapts Mahalanobis distance on token embeddings as input features and trains a linear regressor to produce uncertainty scores. This is a conventional supervised learning setup whose outputs are not algebraically equivalent to its inputs by definition, nor are any load-bearing steps reduced to self-citation chains or fitted parameters renamed as predictions. The derivation chain remains self-contained against external benchmarks (held-out datasets) and does not invoke uniqueness theorems or ansatzes from the authors' prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of Mahalanobis distance (multivariate normality of embeddings) and supervised regression (i.i.d. training data, linear relationship between features and target uncertainty). The linear regression coefficients are fitted parameters. No new entities are postulated.

free parameters (1)
  • linear regression coefficients
    Trained on MD features extracted from token embeddings to map to uncertainty targets.
axioms (1)
  • domain assumption Token embeddings from selected LLM layers are approximately multivariate Gaussian so that Mahalanobis distance is a valid density measure.
    MD computation in the abstract presupposes this distributional form.

pith-pipeline@v0.9.0 · 5737 in / 1258 out tokens · 27672 ms · 2026-05-23T02:49:06.556156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 7 internal anchors

  1. [1]

    Asma Ben Abacha and Dina Demner - Fushman. 2019. https://doi.org/10.1186/S12859-019-3119-4 A question-entailment approach to question answering . BMC Bioinform. , 20(1):511:1--511:23

  2. [2]

    Amos Azaria and Tom Mitchell. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.68 The internal state of an LLM knows when it ' s lying . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967--976, Singapore. Association for Computational Linguistics

  3. [3]

    Joris Baan, Nico Daheim, Evgenia Ilia, Dennis Ulmer, Haau-Sing Li, Raquel Fern \'a ndez, Barbara Plank, Rico Sennrich, Chrysoula Zerva, and Wilker Aziz. 2023. https://arxiv.org/abs/2307.15703 Uncertainty in natural language generation: From theory to applications . arXiv preprint arXiv:2307.15703

  4. [4]

    Sky CH-Wang, Benjamin Van Durme, Jason Eisner, and Chris Kedzie. 2024. https://aclanthology.org/2024.findings-acl.260 Do androids know they ' re only dreaming of electric sheep? In Findings of the Association for Computational Linguistics ACL 2024, pages 4401--4420, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics

  5. [5]

    Jireh Chan, Steven Leow, Khean Bea, Wai Khuen Cheng, Seuk Wai Phoong, Zeng-Wei Hong, and Yen-Lin Chen. 2022. https://doi.org/10.3390/math10081283 Mitigating the multicollinearity problem and its machine learning approach: A review . Mathematics, 10:1283

  6. [6]

    Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. https://openreview.net/forum?id=Zj12nzlQbz INSIDE: llms' internal states retain the power of hallucination detection . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

  7. [7]

    Yuyan Chen, Qiang Fu, Yichen Yuan, Zhihao Wen, Ge Fan, Dayiheng Liu, Dongmei Zhang, Zhixu Li, and Yanghua Xiao. 2023. https://arxiv.org/abs/2407.04121 Hallucination detection: Robustly discerning reliable answers in large language models . In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 245--255

  8. [8]

    Julius Cheng and Andreas Vlachos. 2024. https://aclanthology.org/2024.eacl-long.129 Measuring uncertainty in neural machine translation with similarity-sensitive entropy . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2115--2128, St. Julian ' s, Malta. Associat...

  9. [9]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168

  10. [10]

    Maxime Darrin, Pablo Piantanida, and Pierre Colombo. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.357 R ain P roof: An umbrella to shield text generator from out-of-distribution data . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5831--5857, Singapore. Association for Computational Linguistics

  11. [11]

    Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. https://doi.org/10.18653/v1/2024.acl-long.276 Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational...

  12. [12]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al - Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur \' e lien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi \` e...

  13. [13]

    Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and Siva Reddy. 2022. https://doi.org/10.18653/v1/2022.naacl-main.387 On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...

  14. [14]

    Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, and Maxim Panov. 2024. https://doi.org/10.48550/arXiv.2403.04696 Fact-checking the output of large language models via token-level uncertainty quantification . In Findin...

  15. [15]

    Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, and Artem Shelmanov. 2023. https://doi.org/10.48550/arXiv.2311.07383 LM-Polygraph : Uncertainty estimation for language models . In Proceedings of the 2023 Conference ...

  16. [16]

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. https://www.nature.com/articles/s41586-024-07421-0 Detecting hallucinations in large language models using semantic entropy . Nature, 630(8017):625--630

  17. [17]

    Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. 2024. https://doi.org/10.18653/v1/2024.acl-long.786 Don ' t hallucinate, abstain: Identifying LLM knowledge gaps via multi- LLM collaboration . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  18. [18]

    Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Fr \'e d \'e ric Blain, Francisco Guzm \'a n, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. https://doi.org/10.1162/tacl_a_00330 Unsupervised quality estimation for neural machine translation . Transactions of the Association for Computational Linguistics, 8:539--555

  19. [19]

    Yarin Gal and Zoubin Ghahramani. 2016. https://proceedings.mlr.press/v48/gal16.html Dropout as a Bayesian approximation: Representing model uncertainty in deep learning . In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050--1059, New York, New York, USA. PMLR

  20. [20]

    Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. 2024. https://doi.org/10.18653/v1/2024.naacl-long.366 A survey of confidence estimation and calibration in large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technol...

  21. [21]

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. https://doi.org/10.18653/v1/D19-5409 SAMS um corpus: A human-annotated dialogue dataset for abstractive summarization . In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70--79, Hong Kong, China. Association for Computational Linguistics

  22. [22]

    Jianfeng He, Linlin Yu, Shuo Lei, Chang-Tien Lu, and Feng Chen. 2024 a . https://doi.org/10.18653/v1/2024.findings-naacl.180 Uncertainty estimation on sequential labeling via uncertainty transmission . In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2823--2835, Mexico City, Mexico. Association for Computational Linguistics

  23. [23]

    Jianfeng He, Xuchao Zhang, Shuo Lei, Zhiqian Chen, Fanglan Chen, Abdulaziz Alhamadani, Bei Xiao, and Chang - Tien Lu. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.671 Towards more accurate uncertainty estimation in text classification . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, Nove...

  24. [24]

    Jinwen He, Yujia Gong, Zijin Lin, Cheng ' an Wei, Yue Zhao, and Kai Chen. 2024 b . https://doi.org/10.18653/v1/2024.findings-acl.608 LLM factoscope: Uncovering LLM s ' factual discernment through measuring inner states . In Findings of the Association for Computational Linguistics ACL 2024, pages 10218--10230, Bangkok, Thailand and virtual meeting. Associ...

  25. [25]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net

  26. [26]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \' e lio Renard Lavaud, Marie - Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \' e e Lacroix, and William El Sayed. 2023. https://doi.org/...

  27. [27]

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. https://doi.org/10.18653/v1/D19-1259 P ub M ed QA : A dataset for biomedical research question answering . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-I...

  28. [28]

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

  29. [29]

    Nikita Kotelevskii, Aleksandr Artemenkov, Kirill Fedyanin, Fedor Noskov, Alexander Fishkov, Artem Shelmanov, Artem Vazhentsev, Aleksandr Petiushko, and Maxim Panov. 2022. https://openreview.net/forum?id=v6NNlubbSQ Nonparametric uncertainty quantification for single deterministic neural network . In Advances in Neural Information Processing Systems

  30. [30]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. https://openreview.net/pdf?id=VD-AYtP0dve Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

  31. [31]

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf Simple and scalable predictive uncertainty estimation using deep ensembles . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

  32. [32]

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. https://proceedings.neurips.cc/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html A simple unified framework for detecting out-of-distribution samples and adversarial attacks . In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Sy...

  33. [33]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

  34. [34]

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. https://openreview.net/pdf?id=DWkJCSxKU5 Generating with confidence: Uncertainty quantification for black-box large language models . Transactions on Machine Learning Research

  35. [35]

    Jeremiah Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax Weiss, and Balaji Lakshminarayanan. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/543e83748234f7cbab21aa0ade66565f-Paper.pdf Simple and principled uncertainty estimation with deterministic deep learning via distance awareness . In Advances in Neural Information Processing Sy...

  36. [36]

    Andrey Malinin and Mark J. F. Gales. 2021. https://openreview.net/forum?id=jN5y-zb5Q7m Uncertainty estimation in autoregressive structured prediction . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net

  37. [37]

    Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. https://doi.org/10.48550/arXiv.2303.08896 SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004--9017

  38. [38]

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.741 FA ct S core: Fine-grained atomic evaluation of factual precision in long form text generation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...

  39. [39]

    Cohen, and Mirella Lapata

    Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. https://doi.org/10.18653/v1/d18-1206 Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1...

  40. [40]

    Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. 2024. https://arxiv.org/abs/2405.20003 Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities . arXiv preprint arXiv:2405.20003

  41. [41]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

  42. [42]

    Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, and Irina Piontkovskaya. 2021. https://cdn.aaai.org/ojs/17612/17612-13-21106-1-2-20210518.pdf Revisiting mahalanobis distance for transformer-based out-of-domain detection . In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13675--13682

  43. [43]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res., 21:140:1--140:67

  44. [44]

    Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. https://doi.org/10.1162/tacl_a_00266 C o QA : A conversational question answering challenge . Transactions of the Association for Computational Linguistics, 7:249--266

  45. [45]

    Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. 2023. https://openreview.net/forum?id=kJUS5nD0vPB Out-of-distribution detection and selective generation for conditional language models . In The Eleventh International Conference on Learning Representations

  46. [46]

    Morgane Rivi \` e re, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \' e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \' e , Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgi...

  47. [47]

    Peter J Rousseeuw. 1984. https://doi.org/10.1080/01621459.1984.10477105 Least median of squares regression . Journal of the American statistical association, 79(388):871--880

  48. [48]

    Liu, and Christopher D

    Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. https://doi.org/10.18653/v1/P17-1099 Get to the point: Summarization with pointer-generator networks . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073--1083, Vancouver, Canada. Association for Computational Linguistics

  49. [49]

    Artem Shelmanov, Evgenii Tsymbalov, Dmitri Puzyrev, Kirill Fedyanin, Alexander Panchenko, and Maxim Panov. 2021. https://www.aclweb.org/anthology/2021.eacl-main.157 How certain is your transformer? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1833--1840, Online. Associat...

  50. [50]

    Noora Shrestha. 2020. https://doi.org/10.12691/ajams-8-2-1 Detecting multicollinearity in regression analysis . American Journal of Applied Mathematics and Statistics, 8:39--42

  51. [51]

    Evgenii Tsymbalov, Maxim Panov, and Alexander Shapeev. 2018. https://link.springer.com/chapter/10.1007/978-3-030-11027-7_24 Dropout-based active learning for regression . In Analysis of Images, Social Networks and Texts: 7th International Conference, AIST 2018, Moscow, Russia, July 5--7, 2018, pages 247--258

  52. [52]

    Joost van Amersfoort, Lewis Smith, Yee Whye Teh, and Yarin Gal. 2020. http://proceedings.mlr.press/v119/van-amersfoort20a.html Uncertainty estimation using a single deep deterministic neural network . In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Le...

  53. [53]

    Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Kirill Grishchenkov, Sergey Petrakov, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov, and Artem Shelmanov. 2024. https://arxiv.org/abs/2406.15627 Benchmarking uncertainty quantification methods for lar...

  54. [54]

    Roman Vashurin, Maiya Goloburda, Preslav Nakov, Artem Shelmanov, and Maxim Panov. 2025. https://arxiv.org/abs/2502.04964 Cocoa: A generalized approach to uncertainty quantification by integrating confidence and consistency of llm outputs . arXiv preprint arXiv:2502.04964

  55. [55]

    Artem Vazhentsev, Ekaterina Fadeeva, Rui Xing, Alexander Panchenko, Preslav Nakov, Timothy Baldwin, Maxim Panov, and Artem Shelmanov. 2024. https://arxiv.org/abs/2408.10692 Unconditional truthfulness: Learning conditional dependency for uncertainty quantification of large language models . arXiv preprint arXiv:2408.10692

  56. [56]

    Artem Vazhentsev, Gleb Kuzmin, Artem Shelmanov, Akim Tsvigun, Evgenii Tsymbalov, Kirill Fedyanin, Maxim Panov, Alexander Panchenko, Gleb Gusev, Mikhail Burtsev, Manvel Avetisian, and Leonid Zhukov. 2022. https://doi.org/10.18653/v1/2022.acl-long.566 Uncertainty estimation of transformer predictions for misclassification detection . In Proceedings of the 6...

  57. [57]

    Artem Vazhentsev, Gleb Kuzmin, Akim Tsvigun, Alexander Panchenko, Maxim Panov, Mikhail Burtsev, and Artem Shelmanov. 2023 a . https://aclanthology.org/2023.acl-long.652 Hybrid uncertainty quantification for selective text classification in ambiguous tasks . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume ...

  58. [58]

    Artem Vazhentsev, Akim Tsvigun, Roman Vashurin, Sergey Petrakov, Daniil Vasilev, Maxim Panov, Alexander Panchenko, and Artem Shelmanov. 2023 b . https://aclanthology.org/2023.findings-acl.93 Efficient out-of-domain detection for sequence to sequence models . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1430--1454, Toronto,...

  59. [59]

    Yuxia Wang, Daniel Beck, Timothy Baldwin, and Karin Verspoor. 2022. https://doi.org/10.1162/tacl_a_00483 Uncertainty estimation and reduction of pre-trained models for text regression . Transactions of the Association for Computational Linguistics, 10:680--696

  60. [60]

    Liu, and Matt Gardner

    Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics

  61. [61]

    Yijun Xiao and William Yang Wang. 2021. https://doi.org/10.18653/v1/2021.eacl-main.236 On hallucination and predictive uncertainty in conditional language generation . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2734--2744, Online. Association for Computational Linguistics

  62. [62]

    Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. 2021. https://doi.org/10.18653/v1/2021.acl-long.84 The art of abstention: Selective prediction and error regularization for natural language processing . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Languag...

  63. [63]

    KiYoon Yoo, Jangho Kim, Jiho Jang, and Nojun Kwak. 2022. https://doi.org/10.18653/v1/2022.findings-acl.289 Detection of adversarial examples in text classification: Benchmark and baseline via robust density estimation . In Findings of the Association for Computational Linguistics: ACL 2022, pages 3656--3672, Dublin, Ireland. Association for Computational ...

  64. [64]

    Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. https://doi.org/10.18653/v1/2023.acl-long.634 A lign S core: Evaluating factual consistency with a unified alignment function . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328--11348, Toronto, Canada. Association for Co...

  65. [65]

    Xuchao Zhang, Fanglan Chen, Chang-Tien Lu, and Naren Ramakrishnan. 2019. https://doi.org/10.18653/v1/N19-1316 Mitigating uncertainty in document classification . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3126--...

  66. [66]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  67. [67]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...