Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
arXiv preprint arXiv:2402.13213 , year=
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
A single algorithm for online multicalibration achieves instance-adaptive rates by dynamically refining a dyadic prediction grid, recovering the worst-case Õ(T^{2/3}) bound and improving to Õ(√T) in marginal stochastic settings and Õ(√(JT)) for J-piecewise stationary means.
PromptNCE frames LLM conditional probability estimation as contrastive prompting augmented with an OTHER category, recovering true P(y|x) and achieving up to 0.82 Spearman correlation with human-derived PMI on three datasets.
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Negative log-likelihood of the greedy-decoded most likely sequence (G-NLL) is a principled single-sequence uncertainty measure for LLMs that achieves state-of-the-art results.
Uncertainty-aware fine-tuning with a decision-theory-based loss produces better-calibrated uncertainty estimates than standard training on free-form QA tasks.
citing papers explorer
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
Instance-Adaptive Online Multicalibration
A single algorithm for online multicalibration achieves instance-adaptive rates by dynamically refining a dyadic prediction grid, recovering the worst-case Õ(T^{2/3}) bound and improving to Õ(√T) in marginal stochastic settings and Õ(√(JT)) for J-piecewise stationary means.
-
PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts
PromptNCE frames LLM conditional probability estimation as contrastive prompting augmented with an OTHER category, recovering true P(y|x) and achieving up to 0.82 Spearman correlation with human-derived PMI on three datasets.
-
Causal Evidence that Language Models use Confidence to Drive Behavior
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure
Negative log-likelihood of the greedy-decoded most likely sequence (G-NLL) is a principled single-sequence uncertainty measure for LLMs that achieves state-of-the-art results.
-
Enhancing Trust in Large Language Models via Uncertainty-Calibrated Fine-Tuning
Uncertainty-aware fine-tuning with a decision-theory-based loss produces better-calibrated uncertainty estimates than standard training on free-form QA tasks.