pith. sign in

arxiv: 2408.10692 · v2 · submitted 2024-08-20 · 💻 cs.CL

Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models

Pith reviewed 2026-05-23 22:10 UTC · model grok-4.3

classification 💻 cs.CL
keywords uncertainty quantificationlarge language modelsselective generationattention mapsregression modelautoregressive generationhallucination detectionunconditional uncertainty
0
0 comments X

The pith

A regression model on attention maps and recurrent scores learns unconditional uncertainty in LLMs for selective generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the difficulty of obtaining reliable uncertainty scores for autoregressive LLMs because of the conditional dependencies between successive generation steps. It proposes training a regression model that takes as input the LLM's attention maps, the token probabilities at the current step, and uncertainty values computed recurrently from earlier tokens. A two-staged training procedure is introduced to incorporate the recurrent features. Experiments across ten datasets and three LLMs show that the resulting uncertainty estimates produce substantial gains in selective generation compared with existing unsupervised and supervised baselines.

Core claim

By training a regression model on LLM attention maps, current-step probabilities, and recurrently computed prior uncertainty scores, the conditional dependencies between autoregressive generation steps can be learned implicitly, yielding unconditional uncertainty estimates that support more accurate detection of low-quality or hallucinated outputs.

What carries the argument

Two-staged regression model that ingests attention maps, current probabilities, and recurrent uncertainty scores from previous tokens.

If this is right

  • Substantial gains in selective generation over rival unsupervised and supervised methods.
  • Consistent effectiveness across ten datasets and three different LLMs.
  • More reliable identification of hallucinations and low-quality outputs through improved uncertainty scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The recurrent training procedure could support online uncertainty updates during a single generation run without retraining.
  • Similar attention-plus-recurrent feature sets might transfer to uncertainty estimation in non-text autoregressive domains such as protein sequence models.
  • If the learned scores prove stable, they could serve as a lightweight reliability filter inside production LLM pipelines.

Load-bearing premise

Attention maps, current-step probabilities, and recurrent prior uncertainty scores together suffice to model the conditional dependency between autoregressive generation steps.

What would settle it

Applying the trained regressor to a held-out LLM or dataset and finding no improvement or a drop in selective-generation metrics relative to baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2408.10692 by Alexander Panchenko, Artem Shelmanov, Artem Vazhentsev, Ekaterina Fadeeva, Gleb Kuzmin, Ivan Lazichny, Maxim Panov, Preslav Nakov, Rui Xing, Timothy Baldwin.

Figure 1
Figure 1. Figure 1: An illustration of the proposed method TAD. The figure depicts generated tokens, uncertainty scores [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The fraction of cases where Gemma 7b pays [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Summary of 54 experimental setups with various models and datasets. Each cell in the diagram presents the fraction of experiments where a method from a row outperforms a method from a column. Warmer colors indicate better results. UQ Method Gemma 7b Llama-3 8b StableLM 12b Mean Rank MSP 8.61 7.17 6.83 4.50 Perplexity 7.78 8.44 8.33 5.33 Mean Token Entropy 8.94 9.11 9.00 9.00 Focus 13.00 9.50 10.50 13.67 Nu… view at source ↗
read the original abstract

Uncertainty quantification (UQ) has emerged as a promising approach for detecting hallucinations and low-quality output of Large Language Models (LLMs). However, obtaining proper uncertainty scores is complicated by the conditional dependency between the generation steps of an autoregressive LLM because it is hard to model it explicitly. Here, we propose to learn this dependency from attention-based features. In particular, we train a regression model that leverages LLM attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens. To incorporate the recurrent features, we also suggest a two-staged training procedure. Our experimental evaluation on ten datasets and three LLMs shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes learning unconditional uncertainty scores for autoregressive LLMs by training a regression model on attention maps, current-step token probabilities, and recurrently computed prior uncertainty scores, using a two-staged training procedure to incorporate the recurrent features. It claims this implicitly captures the conditional dependencies between generation steps (which are hard to model explicitly) and demonstrates substantial gains in selective generation over unsupervised and supervised baselines across ten datasets and three LLMs.

Significance. If the central claim holds, the work would offer a practical supervised route to uncertainty quantification that sidesteps explicit modeling of autoregressive conditionals, with direct utility for selective generation and hallucination detection. The multi-dataset, multi-LLM evaluation is a strength, but the absence of any diagnostic confirming that the chosen features actually encode the claimed dependencies limits the result's interpretability.

major comments (2)
  1. [abstract and method description] The central claim (abstract and §3) that the regression learns the conditional dependency between generation steps rests on the unverified assumption that attention maps + current probabilities + recurrent prior scores suffice. No diagnostic is reported (e.g., ablation on sequence position sensitivity, comparison against an explicit chain-rule baseline, or analysis of whether predictions degrade when recurrent features are removed), so the reported gains cannot be confidently attributed to dependency modeling rather than incidental feature correlations.
  2. [§4] §4 (experimental setup) does not specify the exact regression architecture, loss, data splits, or hyper-parameter search procedure. Without these details it is impossible to assess whether the claimed improvements are robust or sensitive to post-hoc choices, undermining the soundness evaluation of the ten-dataset results.
minor comments (2)
  1. [§3] Notation for the recurrent uncertainty score is introduced without an explicit recurrence equation; adding it would clarify the two-staged training procedure.
  2. [tables in §5] Table captions should explicitly state the number of runs or seeds used for the reported means and standard deviations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional details and supporting analyses where appropriate.

read point-by-point responses
  1. Referee: [abstract and method description] The central claim (abstract and §3) that the regression learns the conditional dependency between generation steps rests on the unverified assumption that attention maps + current probabilities + recurrent prior scores suffice. No diagnostic is reported (e.g., ablation on sequence position sensitivity, comparison against an explicit chain-rule baseline, or analysis of whether predictions degrade when recurrent features are removed), so the reported gains cannot be confidently attributed to dependency modeling rather than incidental feature correlations.

    Authors: We agree that explicit diagnostics would strengthen attribution of the gains to dependency modeling. The two-staged training procedure is specifically designed to allow the regression model to incorporate recurrent prior uncertainty scores and thereby learn these dependencies implicitly from the provided features. The consistent improvements across ten datasets and three LLMs provide supporting evidence for the approach. To address the concern, we will add an ablation removing the recurrent features in the revised version and report the resulting performance drop. An explicit chain-rule baseline is difficult to construct given the paper's premise that such dependencies are hard to model directly, but we will discuss this limitation. revision: yes

  2. Referee: [§4] §4 (experimental setup) does not specify the exact regression architecture, loss, data splits, or hyper-parameter search procedure. Without these details it is impossible to assess whether the claimed improvements are robust or sensitive to post-hoc choices, undermining the soundness evaluation of the ten-dataset results.

    Authors: We acknowledge the omission and will expand §4 in the revision to include the precise regression architecture, loss function, data splitting protocol, and hyper-parameter search procedure. These additions will improve reproducibility and allow readers to evaluate the robustness of the reported results. revision: yes

Circularity Check

0 steps flagged

Empirical supervised regression on LLM features shows no circular derivation

full rationale

The paper describes an empirical supervised approach that extracts attention maps, current-step probabilities, and recurrently computed prior uncertainty scores as input features, then trains a regression model (with a two-staged procedure for recurrent features) to produce uncertainty scores. No equations, self-definitions, or self-citations are presented that reduce the output scores to the inputs by construction; the central claim rests on experimental gains in selective generation across ten datasets rather than any tautological renaming or fitted-input prediction. The derivation chain is therefore self-contained as standard feature-based supervised learning.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The regression model itself is treated as a learned component whose parameters are fitted to data.

pith-pipeline@v0.9.0 · 5687 in / 1005 out tokens · 15826 ms · 2026-05-23T22:10:03.787539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

    cs.CL 2025-02 unverdicted novelty 6.0

    Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.

  2. Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure

    cs.LG 2024-12 unverdicted novelty 5.0

    Negative log-likelihood of the greedy-decoded most likely sequence (G-NLL) is a principled single-sequence uncertainty measure for LLMs that achieves state-of-the-art results.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Asma Ben Abacha and Dina Demner - Fushman. 2019. https://doi.org/10.1186/S12859-019-3119-4 A question-entailment approach to question answering . BMC Bioinform. , 20(1):511:1--511:23

  4. [4]

    Amos Azaria and Tom Mitchell. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.68 The internal state of an LLM knows when it ' s lying . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967--976, Singapore. Association for Computational Linguistics

  5. [5]

    Joris Baan, Nico Daheim, Evgenia Ilia, Dennis Ulmer, Haau-Sing Li, Raquel Fern \'a ndez, Barbara Plank, Rico Sennrich, Chrysoula Zerva, and Wilker Aziz. 2023. Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703

  6. [6]

    Marco Bellagente, Jonathan Tow, Dakota Mahan, Duy Phung, Maksym Zhuravinskyi, Reshinth Adithyan, James Baicoianu, Ben Brooks, Nathan Cooper, Ashish Datta, et al. 2024. Stable lm 2 1.6 b technical report. arXiv preprint arXiv:2402.17834

  7. [7]

    Yuyan Chen, Qiang Fu, Yichen Yuan, Zhihao Wen, Ge Fan, Dayiheng Liu, Dongmei Zhang, Zhixu Li, and Yanghua Xiao. 2023. Hallucination detection: Robustly discerning reliable answers in large language models. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 245--255

  8. [8]

    Julius Cheng and Andreas Vlachos. 2024. https://aclanthology.org/2024.eacl-long.129 Measuring uncertainty in neural machine translation with similarity-sensitive entropy . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2115--2128, St. Julian ' s, Malta. Associat...

  9. [9]

    Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2023. https://arxiv.org/abs/2307.01379 Shifting attention to relevance: Towards the uncertainty estimation of large language models . Preprint, arXiv:2307.01379

  10. [10]

    Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, et al. 2024. Fact-checking the output of large language models via token-level uncertainty quantification. arXiv preprint arXiv:2403.04696

  11. [11]

    Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, and Artem Shelmanov. 2023. https://doi.org/10.18653/v1/2023.emnlp-demo.41 LM -polygraph: Uncertainty estimation for language models . In Proceedings of the 2023 Confer...

  12. [12]

    Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Fr \'e d \'e ric Blain, Francisco Guzm \'a n, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. https://doi.org/10.1162/tacl_a_00330 Unsupervised quality estimation for neural machine translation . Transactions of the Association for Computational Linguistics, 8:539--555

  13. [13]

    Yarin Gal and Zoubin Ghahramani. 2016. https://proceedings.mlr.press/v48/gal16.html Dropout as a Bayesian approximation: Representing model uncertainty in deep learning . In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050--1059, New York, New York, USA. PMLR

  14. [14]

    Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. 2023. A survey of language model confidence estimation and calibration. arXiv preprint arXiv:2311.08298

  15. [15]

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. https://doi.org/10.18653/v1/D19-5409 SAMS um corpus: A human-annotated dialogue dataset for abstractive summarization . In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70--79, Hong Kong, China. Association for Computational Linguistics

  16. [16]

    Jianfeng He, Linlin Yu, Shuo Lei, Chang-Tien Lu, and Feng Chen. 2024. https://doi.org/10.18653/v1/2024.findings-naacl.180 Uncertainty estimation on sequential labeling via uncertainty transmission . In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2823--2835, Mexico City, Mexico. Association for Computational Linguistics

  17. [17]

    Jianfeng He, Xuchao Zhang, Shuo Lei, Zhiqian Chen, Fanglan Chen, Abdulaziz Alhamadani, Bei Xiao, and ChangTien Lu. 2020. Towards more accurate uncertainty estimation in text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8362--8372

  18. [18]

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. https://doi.org/10.18653/v1/D19-1259 P ub M ed QA : A dataset for biomedical research question answering . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-I...

  19. [19]

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Assoc...

  20. [20]

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221

  21. [21]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. https://openreview.net/pdf?id=VD-AYtP0dve Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net

  22. [22]

    Salem Lahlou, Moksh Jain, Hadi Nekoei, Victor I Butoi, Paul Bertin, Jarrid Rector-Brooks, Maksym Korablyov, and Yoshua Bengio. 2022. DEUP : Direct epistemic uncertainty prediction. Transactions on Machine Learning Research

  23. [23]

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf Simple and scalable predictive uncertainty estimation using deep ensembles . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

  24. [24]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

  25. [25]

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. https://doi.org/10.48550/arXiv.2305.19187 Generating with confidence: Uncertainty quantification for black-box large language models . CoRR, abs/2305.19187

  26. [26]

    Yu Lu, Jiali Zeng, Jiajun Zhang, Shuangzhi Wu, and Mu Li. 2022. https://doi.org/10.18653/v1/2022.acl-long.167 Learning confidence for transformer-based neural machine translation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2353--2364, Dublin, Ireland. Association for Computati...

  27. [27]

    Andrey Malinin and Mark J. F. Gales. 2021. https://openreview.net/forum?id=jN5y-zb5Q7m Uncertainty estimation in autoregressive structured prediction . In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net

  28. [28]

    Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.557 S elf C heck GPT : Zero-resource black-box hallucination detection for generative large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004--9017, Singapore. Association for Computational...

  29. [29]

    Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \` e re, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, L \' e onard Hussenot, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro - Ros, Ambrose Slone, Am \' e lie H \' e liou, Andrea Tacchetti, Anna Bulanova, Antonia Paterso...

  30. [30]

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.741 FA ct S core: Fine-grained atomic evaluation of factual precision in long form text generation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...

  31. [31]

    Cohen, and Mirella Lapata

    Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. https://doi.org/10.18653/v1/D18-1206 Don ' t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797--1807, Brussels, Belgium. Association for Co...

  32. [32]

    Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. 2024. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities. arXiv preprint arXiv:2405.20003

  33. [33]

    Yookoon Park and David Blei. 2024. Density uncertainty layers for reliable uncertainty estimation. In International Conference on Artificial Intelligence and Statistics, pages 163--171. PMLR

  34. [34]

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31

  35. [35]

    Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. https://doi.org/10.1162/tacl_a_00266 C o QA : A conversational question answering challenge . Transactions of the Association for Computational Linguistics, 7:249--266

  36. [36]

    Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. 2023. https://openreview.net/forum?id=kJUS5nD0vPB Out-of-distribution detection and selective generation for conditional language models . In The Eleventh International Conference on Learning Representations

  37. [37]

    Liu, and Christopher D

    Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. https://doi.org/10.18653/v1/P17-1099 Get to the point: Summarization with pointer-generator networks . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073--1083, Vancouver, Canada. Association for Computational Linguistics

  38. [38]

    Junya Takayama and Yuki Arase. 2019. https://doi.org/10.18653/v1/W19-4115 Relevant and informative response generation using pointwise mutual information . In Proceedings of the First Workshop on NLP for Conversational AI, pages 133--138. Association for Computational Linguistics

  39. [39]

    Liam van der Poel, Ryan Cotterell, and Clara Meister. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.399 Mutual information alleviates hallucinations in abstractive summarization . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5956--5965. Association for Computational Linguistics

  40. [40]

    Artem Vazhentsev, Gleb Kuzmin, Akim Tsvigun, Alexander Panchenko, Maxim Panov, Mikhail Burtsev, and Artem Shelmanov. 2023. https://doi.org/10.18653/v1/2023.acl-long.652 Hybrid uncertainty quantification for selective text classification in ambiguous tasks . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume ...

  41. [41]

    Yuxia Wang, Daniel Beck, Timothy Baldwin, and Karin Verspoor. 2022. https://doi.org/10.1162/tacl_a_00483 Uncertainty estimation and reduction of pre-trained models for text regression . Transactions of the Association for Computational Linguistics, 10:680--696

  42. [42]

    Liu, and Matt Gardner

    Johannes Welbl, Nelson F. Liu, and Matt Gardner. 2017. https://doi.org/10.18653/v1/W17-4413 Crowdsourcing multiple choice science questions . In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94--106, Copenhagen, Denmark. Association for Computational Linguistics

  43. [43]

    Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. 2021. https://doi.org/10.18653/v1/2021.acl-long.84 The art of abstention: Selective prediction and error regularization for natural language processing . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Languag...

  44. [44]

    Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. https://aclanthology.org/2023.acl-long.634 A lign S core: Evaluating factual consistency with a unified alignment function . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328--11348, Toronto, Canada. Association for Compu...

  45. [45]

    Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.58 Enhancing uncertainty-based hallucination detection with stronger focus . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 915--932, Singapor...

  46. [46]

    Xuchao Zhang, Fanglan Chen, Chang-Tien Lu, and Naren Ramakrishnan. 2019. https://doi.org/10.18653/v1/N19-1316 Mitigating uncertainty in document classification . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3126--...