arxiv: 2302.09664 · v3 · submitted 2023-02-19 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn , Yarin Gal , Sebastian Farquhar

Authors on Pith no claims yet

Pith reviewed 2026-05-12 17:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords semantic entropyuncertainty estimationnatural language generationquestion answeringlarge language modelssemantic equivalencelinguistic invariancemodel calibration

0 comments

The pith

Semantic entropy, which groups model outputs by shared meaning before measuring uncertainty, predicts answer accuracy more reliably than token-level entropy on question answering tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a way to quantify uncertainty in large language models when they generate natural language answers, such as to questions. Conventional entropy calculations treat every distinct sentence as unique, even when different wordings convey identical information, which distorts the uncertainty signal. To address this, the method first clusters generated sentences into semantic equivalence classes using the model itself without supervision, then computes entropy over the probabilities of these meaning-based clusters. Experiments across question answering datasets demonstrate that this semantic entropy correlates more strongly with whether the model’s answer is correct than several baseline uncertainty measures. The result matters because accurate uncertainty estimates let users know when to trust or disregard a model’s output in practical settings.

Core claim

The authors introduce semantic entropy as an entropy measure over semantic equivalence classes of generated sentences rather than over individual token sequences. Sentences are grouped into classes that share the same meaning through an unsupervised procedure that queries the language model itself; the entropy is then taken with respect to the total probability mass assigned to each class. This construction is invariant to linguistic rephrasings that preserve meaning and requires no model modifications, additional training data, or auxiliary models. Ablation studies on multiple question answering benchmarks show that semantic entropy is more predictive of model accuracy than comparable token

What carries the argument

Semantic entropy: entropy computed over clusters of semantically equivalent generations identified unsupervised by the model itself.

Load-bearing premise

Semantic equivalence classes among generated sentences can be reliably identified in an unsupervised manner using the language model itself.

What would settle it

A dataset or experiment in which the unsupervised clustering places semantically distinct answers into the same class (or vice versa) and semantic entropy loses its advantage in predicting accuracy over baselines.

read the original abstract

We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation models. We show that measuring uncertainty in natural language is challenging because of "semantic equivalence" -- different sentences can mean the same thing. To overcome these challenges we introduce semantic entropy -- an entropy which incorporates linguistic invariances created by shared meanings. Our method is unsupervised, uses only a single model, and requires no modifications to off-the-shelf language models. In comprehensive ablation studies we show that the semantic entropy is more predictive of model accuracy on question answering data sets than comparable baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semantic entropy improves uncertainty estimates by grouping generations by meaning instead of wording, but the clustering step may undermine results when the model is already uncertain.

read the letter

The main point is that this paper measures uncertainty in language model outputs by computing entropy over semantic equivalence classes rather than individual token sequences or surface forms. They sample multiple answers, cluster those that mean the same thing, and treat the probability mass of each cluster as a single outcome. This produces a lower entropy when the model is consistent in meaning even if the wording varies, which matches the practical goal of knowing whether the answer is reliable on tasks like question answering. The approach stays unsupervised and requires no model changes or extra training data, which is a practical plus. Their ablations on QA datasets show semantic entropy correlating better with actual accuracy than token-level entropy or other simple baselines, and that result is the clearest evidence they present. The idea itself is a reasonable extension of existing entropy methods once you accept that meaning, not phrasing, is what matters for most downstream uses. The soft spot is exactly the one in the stress test. Equivalence classes are built by having the same model judge entailment or similarity between its own generations. When the model is uncertain about the answer, those judgments are also likely to be inconsistent or wrong, which directly feeds into the entropy calculation and could exaggerate the reported advantage. The abstract claims comprehensive ablations but does not describe how clusters are formed in detail, whether they checked agreement with humans, or whether performance holds when clustering is done with a stronger external model. If the full paper includes those controls or shows the gains survive them, the claim strengthens; without them the central result rests on an untested assumption. This work is aimed at people building or evaluating uncertainty methods for generative models in applied settings. Anyone working on calibration or selective prediction for QA or summarization will find the experiments useful to read even if they later modify the clustering. It is worth sending to peer review so that reviewers can examine the implementation details and test whether the circularity issue actually affects the numbers.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces semantic entropy, an uncertainty measure for natural language generation that incorporates linguistic invariances arising from semantic equivalence among different phrasings. The approach is unsupervised, relies on a single off-the-shelf language model without modifications, and is evaluated via ablation studies claiming superior predictive power for model accuracy on question-answering datasets relative to standard baselines.

Significance. If the central empirical claim holds after addressing the clustering validation, the work would offer a practical advance in uncertainty estimation for NLG by handling semantic equivalence without external supervision or model changes. The unsupervised single-model design is a notable strength that could facilitate broader adoption in reliability-critical applications.

major comments (2)

[Ablation studies] The ablation studies' claim of superior predictive performance for semantic entropy depends on the reliability of the unsupervised semantic equivalence clustering step, yet no details are supplied on the exact prompting/embedding procedure used to form clusters or on any independent validation of cluster quality (e.g., human agreement rates stratified by model confidence level).
[Method] Because equivalence judgments are obtained from the same language model whose uncertainty is being quantified, the clustering step risks producing unreliable or inconsistent partitions precisely when the model is uncertain about the answer; this directly affects the entropy calculation and could inflate the reported advantage over baselines.

minor comments (1)

[Abstract] The abstract states empirical superiority on QA datasets but omits any mention of the statistical tests employed or controls for confounding factors such as generation length or sampling temperature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify important aspects of our work on semantic entropy. We address each major point below and will revise the manuscript to improve transparency and robustness.

read point-by-point responses

Referee: [Ablation studies] The ablation studies' claim of superior predictive performance for semantic entropy depends on the reliability of the unsupervised semantic equivalence clustering step, yet no details are supplied on the exact prompting/embedding procedure used to form clusters or on any independent validation of cluster quality (e.g., human agreement rates stratified by model confidence level).

Authors: We agree that greater detail on the clustering procedure is needed for reproducibility. In the revised manuscript, we will expand the methods section to fully specify the prompting strategy for equivalence judgments and the embedding approach used to form clusters. We will also add a human evaluation of cluster quality, reporting agreement rates and stratifying results by model confidence levels to directly validate this component of the method. revision: yes
Referee: [Method] Because equivalence judgments are obtained from the same language model whose uncertainty is being quantified, the clustering step risks producing unreliable or inconsistent partitions precisely when the model is uncertain about the answer; this directly affects the entropy calculation and could inflate the reported advantage over baselines.

Authors: This is a substantive methodological concern. Using the same model for equivalence judgments introduces a potential dependency that could affect cluster reliability in low-confidence regimes. We will add a dedicated discussion section in the revision addressing this limitation, including analysis of how the entropy measure behaves under varying confidence levels and why the observed performance gains are not solely attributable to this effect. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in semantic entropy derivation

full rationale

The paper defines semantic entropy by extending standard entropy to group generations into semantic equivalence classes identified unsupervised via the same model. No equations or steps in the provided text reduce the final measure to a fitted parameter, self-referential definition, or load-bearing self-citation by construction. The method is explicitly described as model-agnostic and unsupervised without modifications, and ablation results are presented as empirical comparisons to baselines rather than forced outcomes. This satisfies the default expectation of a non-circular paper; the clustering step is a methodological choice whose quality is not shown to be tautological with the uncertainty output.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no information on free parameters, axioms, or invented entities; the method is described at a high level without implementation specifics.

pith-pipeline@v0.9.0 · 5405 in / 959 out tokens · 42986 ms · 2026-05-12T17:55:04.481439+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Inducing Artificial Uncertainty in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
cs.CL 2026-05 unverdicted novelty 7.0

BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...
Task-Aware Calibration: Provably Optimal Decoding in LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations
cs.CL 2026-05 unverdicted novelty 7.0

LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-mo...
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 7.0

Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations
cs.AI 2026-04 conditional novelty 7.0

GROVE visualizes distributions of language model generations as overlapping paths through a text graph, with user studies showing that graph summaries aid structural judgments like diversity assessment while raw outpu...
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
cs.SE 2026-04 unverdicted novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
Bridging the Missing-Modality Gap: Improving Text-Only Calibration of Vision Language Models
cs.CL 2026-04 unverdicted novelty 7.0

A new Latent Imagination Module uses cross-attention to predict latent visual embeddings from text, improving accuracy and calibration of vision-language models on text-only inputs.
LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information
stat.ML 2026-05 unverdicted novelty 6.0

Response entropy in LLMs rises with missing context on SQuAD while sampling-based confidence stays high, supporting the multiple imputation criterion and introducing a diagnostic for uncertainty reduction by context level.
Uncertainty Quantification for LLM-based Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
Active Testing of Large Language Models via Approximate Neyman Allocation
cs.AI 2026-05 unverdicted novelty 6.0

Active testing via surrogate semantic entropy stratification and approximate Neyman allocation reduces MSE by up to 28% versus uniform sampling and saves about 23% of the labeling budget on language and multimodal benchmarks.
Annotations Mitigate Post-Training Mode Collapse
cs.CL 2026-05 unverdicted novelty 6.0

Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
cs.AI 2026-05 unverdicted novelty 6.0

Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization
cs.AI 2026-05 unverdicted novelty 6.0

Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.
Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation
cs.CL 2026-05 unverdicted novelty 6.0

DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.
Geometry-Calibrated Conformal Abstention for Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Geometry-calibrated conformal abstention lets language models abstain from uncertain queries with finite-sample guarantees on both participation rate and conditional correctness of answers.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
cs.CR 2026-04 unverdicted novelty 6.0

LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
cs.AI 2026-04 unverdicted novelty 6.0

Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation
cs.CL 2026-04 unverdicted novelty 6.0

CLSGen is a dual-head LLM fine-tuning framework that enables joint probabilistic classification and verbalized explanation generation without catastrophic forgetting of generative capabilities.
Rag Performance Prediction for Question Answering
cs.CL 2026-04 unverdicted novelty 6.0

A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
cs.SE 2026-03 unverdicted novelty 6.0

Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
BALAR : A Bayesian Agentic Loop for Active Reasoning
cs.AI 2026-05 unverdicted novelty 5.0

BALAR is a task-agnostic Bayesian loop that maintains structured beliefs over latent states, selects questions via expected mutual information, and expands its state space when needed, delivering 14.6-38.5% accuracy g...
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
cs.LG 2026-05 unverdicted novelty 5.0

UE-DPO quantifies epistemic uncertainty from grounding failures to direct more learning pressure on hard visual tokens in preferred samples while easing penalties on dispreferred ones.
LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy
cs.LG 2026-05 unverdicted novelty 5.0

ACSE estimates LLM prompt uncertainty via adaptive clustering of semantic entropy across multiple responses and uses conformal prediction to bound error rates on accepted answers with distribution-free guarantees.
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
cs.AI 2026-05 unverdicted novelty 5.0

U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
cs.LG 2026-04 unverdicted novelty 4.0

HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 31 Pith papers · 6 internal anchors

[1]

Language models are few-shot learners

6 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[2]

PaLM: Scaling Language Modeling with Pathways

1 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Calibration of pre-trained transformers

6 Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 295– 302,

work page 2020
[4]

Unsupervised quality estimation for neural machine translation

7 10 Published as a conference paper at ICLR 2023 Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Fr´ed´eric Blain, Francisco Guzm´an, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics , 8: 539–555,

work page 2023
[5]

Uncertainty-aware ma- chine translation evaluation

1, 2 Taisiya Glushkova, Chrysoula Zerva, Ricardo Rei, and Andr ´e FT Martins. Uncertainty-aware ma- chine translation evaluation. arXiv preprint arXiv:2109.06352,

work page arXiv
[6]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

6 Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020a. 5 Ruining He, Anirudh Ravula, Bhargav Kanagal, and Joshua Ainslie. Realformer: Transformer likes residual attention. arXiv preprint arXiv:2012.11747, 2020b. 6 Dan Hendrycks, Nicholas Carlini, Joh...

work page internal anchor Pith review arXiv 2006
[7]

Training Compute-Optimal Large Language Models

1 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models. arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Abstract meaning repre- sentation for paraphrase detection

1 Fuad Issa, Marco Damonte, Shay B Cohen, Xiaohui Yan, and Yi Chang. Abstract meaning repre- sentation for paraphrase detection. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- ume 1 (Long Papers), pp. 442–452,

work page 2018
[9]

Deup: Direct epistemic uncertainty prediction

6 Moksh Jain, Salem Lahlou, Hadi Nekoei, Victor Butoi, Paul Bertin, Jarrid Rector-Brooks, Maksym Korablyov, and Yoshua Bengio. Deup: Direct epistemic uncertainty prediction. arXiv preprint arXiv:2102.08501,

work page arXiv
[10]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

6 Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Language Models (Mostly) Know What They Know

7 Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatﬁeld Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language mod- els (mostly) know what they know. arXiv preprint arXiv:2207.05221,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Teaching models to express their uncertainty in words

7 Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022a. 2, 3, 6 11 Published as a conference paper at ICLR 2023 Zi Lin, Jeremiah Zhe Liu, and Jingbo Shang. Towards collaborative neural-symbolic graph semantic parsing via uncertainty. In Findings of the Association for C...

work page arXiv 2023
[13]

Uncertainty estimation in autoregressive structured prediction

3 Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650,

work page arXiv 2002
[14]

Regression prior networks

1, 2, 3, 4, 5, 7, 8 Andrey Malinin, Sergey Chervontsev, Ivan Provilkov, and Mark Gales. Regression prior networks. arXiv preprint arXiv:2006.11590,

work page arXiv 2006
[15]

Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness

3 Sabrina J Mielke, Arthur Szlam, Y-Lan Boureau, and Emily Dinan. Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness. arXiv preprint arXiv:2012.14983,

work page arXiv 2012
[16]

Correcting length bias in neural machine translation

6 Kenton Murray and David Chiang. Correcting length bias in neural machine translation. arXiv preprint arXiv:1808.10006,

work page arXiv
[17]

Charformer: Fast character transformers via gradient-based subword tokenization

3 Yi Tay, Vinh Q Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Si- mon Baumgartner, Cong Yu, and Donald Metzler. Charformer: Fast character transformers via gradient-based subword tokenization. arXiv preprint arXiv:2106.12672,

work page arXiv
[18]

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman

6 Sinong Wang, Han Fang, Madian Khabsa, Hanzi Mao, and Hao Ma. Entailment as few-shot learner. arXiv preprint arXiv:2104.14690,

work page arXiv
[19]

Bilateral multi-perspective matching for natural language sentences

6 Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814,

work page arXiv
[20]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , url =

6 Adina Williams, Nikita Nangia, and Samuel R Bowman. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426,

work page arXiv
[21]

Deep learning for answer sentence selection

5 12 Published as a conference paper at ICLR 2023 Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632,

work page arXiv 2023
[22]

OPT: Open Pre-trained Transformer Language Models

6 Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo- pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Work with foundation mod- els implicitly focuses on lexical equivalence, which entails the others, but we usually care about semantic equivalence

2, 7 13 Published as a conference paper at ICLR 2023 Table 3: Illustration of semantic, syntactic, and lexical equivalence. Work with foundation mod- els implicitly focuses on lexical equivalence, which entails the others, but we usually care about semantic equivalence. Equivalence Sentence A Sentence B Lexical Syntactic Semantic Paris is the capital of F...

work page 2023
[24]

<g/>”,x, s(m))) ⊿ Does old sequence entail new one? right ← M(cat(x, s(m), “<g/>

Lexically equivalent sequences use exactly the same symbols. They are always also semantically and syntactically equiv- alent (in a given context). Syntactically equivalent sentences have the same grammatical form. But they can have different meanings (not semantically equivalent) and can use different symbols (not lexically equivalent). Semantically equi...

work page 2023
[25]

Additionally, we investigate, why the semantic entropy underperforms the length-normalised entropy at high temperatures

As in the main body of the paper, we measure diversity as the average lexical overlap of the answers in the answer set. Additionally, we investigate, why the semantic entropy underperforms the length-normalised entropy at high temperatures. To that end, we manually inspect and label 100 classiﬁcations of our semantic equivalence method at T=1.5, and we ﬁn...

work page 2023
[26]

We ﬁnd that on CoQA, we obtain accurate model results with zero-shot prompting

We use the following prompts on CoQA and TriviaQA. We ﬁnd that on CoQA, we obtain accurate model results with zero-shot prompting. While we have to use few-shot prompting to obtain accurate answers on closed-book TriviaQA. We use the following prompts for each of the settings: CoQA: [The provided context paragraph] [additional question-answer pairs] Q: [P...

work page 2023
[27]

brainstormed answers

except for the exact matching accuracy criterion which is too demanding because of the much larger variety of possible answers for this task. 17 Published as a conference paper at ICLR 2023 Table 7: CoQA: the exact choice of the accuracy metric for the free-form open-book QA task has little effect on the assessment of the quality of the uncertainty measur...

work page 2023