Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models
Pith reviewed 2026-05-23 02:49 UTC · model grok-4.3
The pith
Adapting Mahalanobis distance to multi-layer token embeddings produces more accurate uncertainty scores for large language models than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We adapt Mahalanobis Distance for text generation by extracting token embeddings from multiple layers of LLMs, computing MD scores for each token, and using linear regression trained on these features to provide robust uncertainty scores. Through extensive experiments on eleven datasets, this approach substantially improves over existing UQ methods, providing accurate and computationally efficient uncertainty scores for both sequence-level selective generation and claim-level fact-checking tasks. Our method also exhibits strong generalization to out-of-domain data.
What carries the argument
Token-level Mahalanobis distance scores computed on embeddings from multiple LLM layers, then passed through a linear regression model to yield uncertainty estimates.
If this is right
- Uncertainty scores can be used directly to decide whether to generate a full response or to abstain in sequence-level tasks.
- The same scores support finer-grained decisions at the level of individual claims inside a longer answer.
- The method remains effective on data that differs from the training distribution, reducing the need for per-domain retraining.
- Computation stays cheaper than methods that require multiple forward passes or consistency checks across samples.
Where Pith is reading between the lines
- If the scores truly reflect model-internal density rather than surface statistics, the same pipeline could be applied to other density-based measures beyond Mahalanobis distance.
- Token-level granularity opens the possibility of editing or masking only the uncertain spans inside an otherwise reliable response.
- Strong out-of-domain results suggest the regression may be learning properties of the model's embedding geometry that are stable across tasks.
Load-bearing premise
The linear regression model trained on Mahalanobis distance features from in-domain token embeddings will continue to give accurate uncertainty scores on out-of-domain data without retraining.
What would settle it
Apply the trained regression to a new dataset drawn from a markedly different distribution and measure whether selective-generation accuracy or fact-checking F1 falls below the best baseline method.
Figures
read the original abstract
Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs). To date, information-based and consistency-based UQ have been the dominant UQ methods for text generation via LLMs. Density-based methods, despite being very effective for UQ in text classification with encoder-based models, have not been very successful with generative LLMs. In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation and introduce a new supervised UQ method. Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores. Through extensive experiments on eleven datasets, we demonstrate that our approach substantially improves over existing UQ methods, providing accurate and computationally efficient uncertainty scores for both sequence-level selective generation and claim-level fact-checking tasks. Our method also exhibits strong generalization to out-of-domain data, making it suitable for a wide range of LLM-based applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper adapts Mahalanobis distance (MD) from classification to LLM text generation by extracting token embeddings from multiple layers, computing per-token MD scores, and training a linear regressor on these features to produce uncertainty scores. It claims this supervised density-based method substantially outperforms existing UQ approaches on eleven datasets for both sequence-level selective generation and claim-level fact-checking, while exhibiting strong out-of-domain generalization.
Significance. If the empirical claims hold under proper controls for baselines, metrics, and data overlap, the method could provide a more efficient alternative to consistency-based UQ for truthfulness elicitation in generative LLMs.
major comments (3)
- [Experiments] Experiments section: the central claim of substantial improvement on eleven datasets provides no information on baseline implementations, exact metrics used, statistical significance tests, or whether regressor training data overlaps with evaluation sets, leaving the empirical support for the main result under-specified.
- [Method] Method section: MD is computed w.r.t. the empirical mean and covariance of in-domain token embeddings; the linear regressor is fitted only on in-domain MD-to-label pairs. The claim of strong OOD generalization therefore requires explicit evidence that distribution shifts in embedding covariances across the eleven datasets are mild enough to be linearly corrected without retraining.
- [Abstract] Abstract and §3: the method is a fitted empirical predictor rather than an algebraically determined score; this supervised character should be stated when contrasting with 'density-based' UQ, and the reported gains must be shown to exceed what a simple supervised baseline on the same features would achieve.
minor comments (2)
- [Method] Clarify the exact procedure for aggregating multi-layer MD scores into the feature vector for the regressor.
- [Experiments] Add a table or section listing the eleven datasets with their domains and train/eval splits to support the OOD generalization claim.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed review. We address each major comment below, agreeing where revisions are needed to improve clarity and rigor, and propose specific changes to the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim of substantial improvement on eleven datasets provides no information on baseline implementations, exact metrics used, statistical significance tests, or whether regressor training data overlaps with evaluation sets, leaving the empirical support for the main result under-specified.
Authors: We agree that additional details are required to fully support the empirical claims. In the revised manuscript, we will expand the Experiments section to describe baseline implementations (including any hyperparameters and code references), specify the exact metrics (e.g., AUROC, AUPRC), report statistical significance tests (such as paired t-tests across runs), and clarify data splits to confirm no overlap between regressor training data and evaluation sets. revision: yes
-
Referee: [Method] Method section: MD is computed w.r.t. the empirical mean and covariance of in-domain token embeddings; the linear regressor is fitted only on in-domain MD-to-label pairs. The claim of strong OOD generalization therefore requires explicit evidence that distribution shifts in embedding covariances across the eleven datasets are mild enough to be linearly corrected without retraining.
Authors: We acknowledge this point and the need for explicit supporting evidence. While the current cross-dataset results already indicate effective generalization, we will add a new subsection with cross-dataset training experiments (training the regressor on subsets of the eleven datasets and evaluating on held-out ones) to demonstrate that the linear correction handles embedding covariance shifts without retraining. revision: partial
-
Referee: [Abstract] Abstract and §3: the method is a fitted empirical predictor rather than an algebraically determined score; this supervised character should be stated when contrasting with 'density-based' UQ, and the reported gains must be shown to exceed what a simple supervised baseline on the same features would achieve.
Authors: We agree that the supervised aspect merits clearer emphasis. Although the abstract already refers to a 'supervised UQ method,' we will revise §3 to explicitly contrast the fitted regressor with purely algebraic density-based scores and add experiments comparing against a simple supervised baseline (linear regression on raw token embeddings or layer-wise statistics) to confirm that the MD features drive the reported gains. revision: yes
Circularity Check
No circularity; standard supervised density-based UQ pipeline
full rationale
The paper adapts Mahalanobis distance on token embeddings as input features and trains a linear regressor to produce uncertainty scores. This is a conventional supervised learning setup whose outputs are not algebraically equivalent to its inputs by definition, nor are any load-bearing steps reduced to self-citation chains or fitted parameters renamed as predictions. The derivation chain remains self-contained against external benchmarks (held-out datasets) and does not invoke uniqueness theorems or ansatzes from the authors' prior work.
Axiom & Free-Parameter Ledger
free parameters (1)
- linear regression coefficients
axioms (1)
- domain assumption Token embeddings from selected LLM layers are approximately multivariate Gaussian so that Mahalanobis distance is a valid density measure.
Reference graph
Works this paper leans on
-
[1]
Asma Ben Abacha and Dina Demner - Fushman. 2019. https://doi.org/10.1186/S12859-019-3119-4 A question-entailment approach to question answering . BMC Bioinform. , 20(1):511:1--511:23
-
[2]
Amos Azaria and Tom Mitchell. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.68 The internal state of an LLM knows when it ' s lying . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967--976, Singapore. Association for Computational Linguistics
-
[3]
Joris Baan, Nico Daheim, Evgenia Ilia, Dennis Ulmer, Haau-Sing Li, Raquel Fern \'a ndez, Barbara Plank, Rico Sennrich, Chrysoula Zerva, and Wilker Aziz. 2023. https://arxiv.org/abs/2307.15703 Uncertainty in natural language generation: From theory to applications . arXiv preprint arXiv:2307.15703
-
[4]
Sky CH-Wang, Benjamin Van Durme, Jason Eisner, and Chris Kedzie. 2024. https://aclanthology.org/2024.findings-acl.260 Do androids know they ' re only dreaming of electric sheep? In Findings of the Association for Computational Linguistics ACL 2024, pages 4401--4420, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics
work page 2024
-
[5]
Jireh Chan, Steven Leow, Khean Bea, Wai Khuen Cheng, Seuk Wai Phoong, Zeng-Wei Hong, and Yen-Lin Chen. 2022. https://doi.org/10.3390/math10081283 Mitigating the multicollinearity problem and its machine learning approach: A review . Mathematics, 10:1283
-
[6]
Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024. https://openreview.net/forum?id=Zj12nzlQbz INSIDE: llms' internal states retain the power of hallucination detection . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net
work page 2024
-
[7]
Yuyan Chen, Qiang Fu, Yichen Yuan, Zhihao Wen, Ge Fan, Dayiheng Liu, Dongmei Zhang, Zhixu Li, and Yanghua Xiao. 2023. https://arxiv.org/abs/2407.04121 Hallucination detection: Robustly discerning reliable answers in large language models . In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 245--255
-
[8]
Julius Cheng and Andreas Vlachos. 2024. https://aclanthology.org/2024.eacl-long.129 Measuring uncertainty in neural machine translation with similarity-sensitive entropy . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2115--2128, St. Julian ' s, Malta. Associat...
work page 2024
-
[9]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Maxime Darrin, Pablo Piantanida, and Pierre Colombo. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.357 R ain P roof: An umbrella to shield text generator from out-of-distribution data . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5831--5857, Singapore. Association for Computational Linguistics
-
[11]
Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. https://doi.org/10.18653/v1/2024.acl-long.276 Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational...
-
[12]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al - Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur \' e lien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi \` e...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[13]
Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and Siva Reddy. 2022. https://doi.org/10.18653/v1/2022.naacl-main.387 On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ...
-
[14]
Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, and Maxim Panov. 2024. https://doi.org/10.48550/arXiv.2403.04696 Fact-checking the output of large language models via token-level uncertainty quantification . In Findin...
-
[15]
Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, and Artem Shelmanov. 2023. https://doi.org/10.48550/arXiv.2311.07383 LM-Polygraph : Uncertainty estimation for language models . In Proceedings of the 2023 Conference ...
-
[16]
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. https://www.nature.com/articles/s41586-024-07421-0 Detecting hallucinations in large language models using semantic entropy . Nature, 630(8017):625--630
work page 2024
-
[17]
Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov. 2024. https://doi.org/10.18653/v1/2024.acl-long.786 Don ' t hallucinate, abstain: Identifying LLM knowledge gaps via multi- LLM collaboration . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...
-
[18]
Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Fr \'e d \'e ric Blain, Francisco Guzm \'a n, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. https://doi.org/10.1162/tacl_a_00330 Unsupervised quality estimation for neural machine translation . Transactions of the Association for Computational Linguistics, 8:539--555
-
[19]
Yarin Gal and Zoubin Ghahramani. 2016. https://proceedings.mlr.press/v48/gal16.html Dropout as a Bayesian approximation: Representing model uncertainty in deep learning . In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050--1059, New York, New York, USA. PMLR
work page 2016
-
[20]
Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. 2024. https://doi.org/10.18653/v1/2024.naacl-long.366 A survey of confidence estimation and calibration in large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technol...
-
[21]
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. https://doi.org/10.18653/v1/D19-5409 SAMS um corpus: A human-annotated dialogue dataset for abstractive summarization . In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70--79, Hong Kong, China. Association for Computational Linguistics
-
[22]
Jianfeng He, Linlin Yu, Shuo Lei, Chang-Tien Lu, and Feng Chen. 2024 a . https://doi.org/10.18653/v1/2024.findings-naacl.180 Uncertainty estimation on sequential labeling via uncertainty transmission . In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2823--2835, Mexico City, Mexico. Association for Computational Linguistics
-
[23]
Jianfeng He, Xuchao Zhang, Shuo Lei, Zhiqian Chen, Fanglan Chen, Abdulaziz Alhamadani, Bei Xiao, and Chang - Tien Lu. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.671 Towards more accurate uncertainty estimation in text classification . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, Nove...
-
[24]
Jinwen He, Yujia Gong, Zijin Lin, Cheng ' an Wei, Yue Zhao, and Kai Chen. 2024 b . https://doi.org/10.18653/v1/2024.findings-acl.608 LLM factoscope: Uncovering LLM s ' factual discernment through measuring inner states . In Findings of the Association for Computational Linguistics ACL 2024, pages 10218--10230, Bangkok, Thailand and virtual meeting. Associ...
-
[25]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net
work page 2021
-
[26]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \' e lio Renard Lavaud, Marie - Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \' e e Lacroix, and William El Sayed. 2023. https://doi.org/...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
-
[27]
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. https://doi.org/10.18653/v1/D19-1259 P ub M ed QA : A dataset for biomedical research question answering . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-I...
-
[28]
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...
-
[29]
Nikita Kotelevskii, Aleksandr Artemenkov, Kirill Fedyanin, Fedor Noskov, Alexander Fishkov, Artem Shelmanov, Artem Vazhentsev, Aleksandr Petiushko, and Maxim Panov. 2022. https://openreview.net/forum?id=v6NNlubbSQ Nonparametric uncertainty quantification for single deterministic neural network . In Advances in Neural Information Processing Systems
work page 2022
-
[30]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. https://openreview.net/pdf?id=VD-AYtP0dve Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023
work page 2023
-
[31]
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf Simple and scalable predictive uncertainty estimation using deep ensembles . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc
work page 2017
-
[32]
Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. https://proceedings.neurips.cc/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html A simple unified framework for detecting out-of-distribution samples and adversarial attacks . In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Sy...
work page 2018
-
[33]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics
-
[34]
Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. https://openreview.net/pdf?id=DWkJCSxKU5 Generating with confidence: Uncertainty quantification for black-box large language models . Transactions on Machine Learning Research
work page 2023
-
[35]
Jeremiah Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax Weiss, and Balaji Lakshminarayanan. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/543e83748234f7cbab21aa0ade66565f-Paper.pdf Simple and principled uncertainty estimation with deterministic deep learning via distance awareness . In Advances in Neural Information Processing Sy...
work page 2020
-
[36]
Andrey Malinin and Mark J. F. Gales. 2021. https://openreview.net/forum?id=jN5y-zb5Q7m Uncertainty estimation in autoregressive structured prediction . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net
work page 2021
-
[37]
Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. https://doi.org/10.48550/arXiv.2303.08896 SelfCheckGPT : Zero-resource black-box hallucination detection for generative large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004--9017
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08896 2023
-
[38]
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.741 FA ct S core: Fine-grained atomic evaluation of factual precision in long form text generation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...
-
[39]
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. https://doi.org/10.18653/v1/d18-1206 Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1...
- [40]
-
[41]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, and Irina Piontkovskaya. 2021. https://cdn.aaai.org/ojs/17612/17612-13-21106-1-2-20210518.pdf Revisiting mahalanobis distance for transformer-based out-of-domain detection . In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13675--13682
work page 2021
-
[43]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. https://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res., 21:140:1--140:67
work page 2020
-
[44]
Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. https://doi.org/10.1162/tacl_a_00266 C o QA : A conversational question answering challenge . Transactions of the Association for Computational Linguistics, 7:249--266
-
[45]
Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. 2023. https://openreview.net/forum?id=kJUS5nD0vPB Out-of-distribution detection and selective generation for conditional language models . In The Eleventh International Conference on Learning Representations
work page 2023
-
[46]
Morgane Rivi \` e re, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \' e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \' e , Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118 2024
-
[47]
Peter J Rousseeuw. 1984. https://doi.org/10.1080/01621459.1984.10477105 Least median of squares regression . Journal of the American statistical association, 79(388):871--880
-
[48]
Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. https://doi.org/10.18653/v1/P17-1099 Get to the point: Summarization with pointer-generator networks . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073--1083, Vancouver, Canada. Association for Computational Linguistics
-
[49]
Artem Shelmanov, Evgenii Tsymbalov, Dmitri Puzyrev, Kirill Fedyanin, Alexander Panchenko, and Maxim Panov. 2021. https://www.aclweb.org/anthology/2021.eacl-main.157 How certain is your transformer? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1833--1840, Online. Associat...
work page 2021
-
[50]
Noora Shrestha. 2020. https://doi.org/10.12691/ajams-8-2-1 Detecting multicollinearity in regression analysis . American Journal of Applied Mathematics and Statistics, 8:39--42
-
[51]
Evgenii Tsymbalov, Maxim Panov, and Alexander Shapeev. 2018. https://link.springer.com/chapter/10.1007/978-3-030-11027-7_24 Dropout-based active learning for regression . In Analysis of Images, Social Networks and Texts: 7th International Conference, AIST 2018, Moscow, Russia, July 5--7, 2018, pages 247--258
-
[52]
Joost van Amersfoort, Lewis Smith, Yee Whye Teh, and Yarin Gal. 2020. http://proceedings.mlr.press/v119/van-amersfoort20a.html Uncertainty estimation using a single deep deterministic neural network . In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Le...
work page 2020
-
[53]
Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Kirill Grishchenkov, Sergey Petrakov, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov, and Artem Shelmanov. 2024. https://arxiv.org/abs/2406.15627 Benchmarking uncertainty quantification methods for lar...
- [54]
-
[55]
Artem Vazhentsev, Ekaterina Fadeeva, Rui Xing, Alexander Panchenko, Preslav Nakov, Timothy Baldwin, Maxim Panov, and Artem Shelmanov. 2024. https://arxiv.org/abs/2408.10692 Unconditional truthfulness: Learning conditional dependency for uncertainty quantification of large language models . arXiv preprint arXiv:2408.10692
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Artem Vazhentsev, Gleb Kuzmin, Artem Shelmanov, Akim Tsvigun, Evgenii Tsymbalov, Kirill Fedyanin, Maxim Panov, Alexander Panchenko, Gleb Gusev, Mikhail Burtsev, Manvel Avetisian, and Leonid Zhukov. 2022. https://doi.org/10.18653/v1/2022.acl-long.566 Uncertainty estimation of transformer predictions for misclassification detection . In Proceedings of the 6...
-
[57]
Artem Vazhentsev, Gleb Kuzmin, Akim Tsvigun, Alexander Panchenko, Maxim Panov, Mikhail Burtsev, and Artem Shelmanov. 2023 a . https://aclanthology.org/2023.acl-long.652 Hybrid uncertainty quantification for selective text classification in ambiguous tasks . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume ...
work page 2023
-
[58]
Artem Vazhentsev, Akim Tsvigun, Roman Vashurin, Sergey Petrakov, Daniil Vasilev, Maxim Panov, Alexander Panchenko, and Artem Shelmanov. 2023 b . https://aclanthology.org/2023.findings-acl.93 Efficient out-of-domain detection for sequence to sequence models . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1430--1454, Toronto,...
work page 2023
-
[59]
Yuxia Wang, Daniel Beck, Timothy Baldwin, and Karin Verspoor. 2022. https://doi.org/10.1162/tacl_a_00483 Uncertainty estimation and reduction of pre-trained models for text regression . Transactions of the Association for Computational Linguistics, 10:680--696
-
[60]
Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics
-
[61]
Yijun Xiao and William Yang Wang. 2021. https://doi.org/10.18653/v1/2021.eacl-main.236 On hallucination and predictive uncertainty in conditional language generation . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2734--2744, Online. Association for Computational Linguistics
-
[62]
Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. 2021. https://doi.org/10.18653/v1/2021.acl-long.84 The art of abstention: Selective prediction and error regularization for natural language processing . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Languag...
-
[63]
KiYoon Yoo, Jangho Kim, Jiho Jang, and Nojun Kwak. 2022. https://doi.org/10.18653/v1/2022.findings-acl.289 Detection of adversarial examples in text classification: Benchmark and baseline via robust density estimation . In Findings of the Association for Computational Linguistics: ACL 2022, pages 3656--3672, Dublin, Ireland. Association for Computational ...
-
[64]
Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. https://doi.org/10.18653/v1/2023.acl-long.634 A lign S core: Evaluating factual consistency with a unified alignment function . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328--11348, Toronto, Canada. Association for Co...
-
[65]
Xuchao Zhang, Fanglan Chen, Chang-Tien Lu, and Naren Ramakrishnan. 2019. https://doi.org/10.18653/v1/N19-1316 Mitigating uncertainty in document classification . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3126--...
-
[66]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[67]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.