Refusal tokens: A simple way to calibrate refusals in large language models

· 2024 · arXiv 2412.06748

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal

cs.SE · 2025-08-15 · unverdicted · novelty 6.0

ORFuzz presents the first evolutionary testing framework for LLM over-refusal together with a new benchmark of 1,855 cases that triggers over-refusal at 63.56% average across ten models.

citing papers explorer

Showing 2 of 2 citing papers.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence cs.CL · 2026-04-03 · unverdicted · none · ref 24
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal cs.SE · 2025-08-15 · unverdicted · none · ref 21
ORFuzz presents the first evolutionary testing framework for LLM over-refusal together with a new benchmark of 1,855 cases that triggers over-refusal at 63.56% average across ten models.

Refusal tokens: A simple way to calibrate refusals in large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer