Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Haining Yu; Jie Xu; Junhao Dong; Libo Qin; Qiguang Chen; Wenbo Pan; Xiaohua Jia; Xinfeng Li

arxiv: 2510.01782 · v2 · pith:VWO2OWZ6new · submitted 2025-10-02 · 💻 cs.CL · cs.AI

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Wenbo Pan , Jie Xu , Qiguang Chen , Junhao Dong , Libo Qin , Xinfeng Li , Haining Yu , Xiaohua Jia This is my paper

Pith reviewed 2026-05-21 21:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Refusal Indexknowledge-aware refusalLLM factualityfactual tasksmodel calibrationSpearman correlationrefusal behaviorerror probability

0 comments

The pith

The Refusal Index measures how well large language models refuse factual questions they do not know by correlating refusal rates with error rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Refusal Index to quantify knowledge-aware refusal, the ability of LLMs to decline questions outside their knowledge on factual tasks. It defines this index as Spearman's rank correlation between refusal probability and error probability, computed through a simple two-pass evaluation on existing datasets. A reader would care because LLMs frequently produce incorrect answers instead of refusing, which undermines trust in their factual outputs. The approach requires no extra ground-truth labels and yields stable results across models. Experiments on 16 models and 5 datasets show that refusal behavior forms a distinct, fragile property separate from overall accuracy.

Core claim

We define the Refusal Index as Spearman's rank correlation between a model's refusal probability and its error probability on factual questions. This index is obtained via a lightweight two-pass method that uses only observed refusal rates from standard evaluation runs. Across tested models and datasets, the index accurately captures knowledge-aware refusal, remains stable regardless of chosen refusal rates, and produces consistent model orderings that do not depend on overall accuracy levels. The results indicate that even high-accuracy models exhibit unreliable refusal on questions they do not know.

What carries the argument

The Refusal Index, computed as Spearman's rank correlation between refusal probability and error probability on factual questions, which isolates the degree to which refusals align with actual knowledge gaps.

If this is right

High factual accuracy does not guarantee reliable refusal behavior on unknown questions.
Model rankings by knowledge-aware refusal stay consistent even when overall accuracy or refusal thresholds vary.
Refusal behavior can be assessed separately from accuracy using only standard evaluation data.
The index identifies models whose refusals better match their actual knowledge limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers might optimize training to raise the index and reduce confident wrong answers.
The same correlation approach could be tested on reasoning or code tasks to check for similar knowledge-awareness gaps.
Tracking changes in the index after safety fine-tuning could show whether alignment improves or harms refusal calibration.

Load-bearing premise

Observed error rates from ordinary evaluation runs reliably estimate a model's true knowledge gaps without separate verification of each answer.

What would settle it

Apply an independent verification pass with external fact-checking to the same set of questions and test whether the correlation between refusals and verified unknowns disappears or reverses.

read the original abstract

Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability, while existing metrics fail to capture this ability. In this work, we propose the Refusal Index (RI), a novel and principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. RI is practically measurable with a lightweight two-pass evaluation method which only require observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's knowledge-aware refusal capability. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. These properties suggest RI captures a stable, intrinsic aspect of model knowledge calibration. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a Refusal Index as Spearman's correlation between refusal rates and error rates from a two-pass setup, but the forced-answer pass may not cleanly proxy knowledge.

read the letter

The main point is that this paper defines a Refusal Index as the Spearman rank correlation between per-question refusal probability and per-question error probability on factual tasks. They measure it with a simple two-pass method: one run allows refusals, the other forces answers to compute errors against ground truth. Experiments on 16 models and 5 datasets show the index stays stable across different refusal rates and produces model rankings that do not simply track overall accuracy or refusal frequency. That independence from raw performance numbers is the clearest new angle here. The scale of the tests gives some weight to the stability claim. The two-pass approach is practical and avoids needing extra ground-truth labels beyond standard evaluation. The soft spot sits in the assumption that error rates from the forced-answer run line up with the model's knowledge state in the refusal-allowed run. If prompt sensitivity or decoding changes shift the effective knowledge signal between passes, the paired values get misaligned and the correlation could reflect behavioral artifacts rather than knowledge-aware refusal. The abstract gives no details on how error probability is exactly computed or any checks for consistency across the two runs. No statistical significance tests for the stability results are mentioned either. This paper is for researchers working on LLM factuality metrics and safety evaluations. Readers who want a lightweight way to score refusal behavior beyond accuracy will get value from the metric idea. It deserves a serious referee because the experiments are broad and the formulation is new, even though the two-pass validation needs tighter checks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Refusal Index (RI), defined as Spearman's rank correlation between per-question refusal probability and per-question error probability, to quantify knowledge-aware refusal in LLMs. It proposes a lightweight two-pass evaluation that derives the metric from observed refusal rates in standard runs (one allowing refusal, one forced-answer). Experiments across 16 models and 5 factual datasets claim that RI accurately measures the capability, remains stable across refusal rates, yields consistent model rankings independent of overall accuracy and refusal rates, and reveals that high factual accuracy does not imply reliable refusal behavior.

Significance. If the measurement assumptions hold, RI would fill a gap in factuality evaluation by isolating how well refusal decisions track actual knowledge gaps rather than generic difficulty or behavioral biases. The reported stability and independence from accuracy/refusal rates would indicate that the metric captures an intrinsic calibration property. The scale of the evaluation (16 models, 5 datasets) provides a solid empirical foundation for these observations and could guide future work on safer factual QA systems.

major comments (2)

[§3] §3 (RI Definition and two-pass method): The central claim that RI 'accurately quantifies' knowledge-aware refusal depends on error probability from the forced-answer pass serving as a valid proxy for the model's knowledge state in the refusal-allowed setting. The manuscript provides no verification or controls for prompt sensitivity, decoding differences, or calibration shifts between passes; without this, the per-question pairing may misalign and the correlation may reflect behavioral artifacts rather than knowledge awareness.
[§4] §4 (Experiments and stability claims): The abstract and results assert stability across refusal rates and consistent rankings independent of accuracy, yet no details are given on how per-question error probability is computed from observed outcomes, any post-hoc choices in aggregation, or statistical significance tests for the reported correlations and stability. This weakens support for the claim that RI isolates the intended capability.

minor comments (2)

[Abstract] Abstract: The phrase 'which only require observed refusal rates' contains a subject-verb agreement error and should read 'which only requires observed refusal rates'.
The manuscript would benefit from explicitly listing the exact prompts and decoding parameters used in each of the two evaluation passes to support reproducibility of the RI computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of the Refusal Index (RI) definition and experimental reporting that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [§3] §3 (RI Definition and two-pass method): The central claim that RI 'accurately quantifies' knowledge-aware refusal depends on error probability from the forced-answer pass serving as a valid proxy for the model's knowledge state in the refusal-allowed setting. The manuscript provides no verification or controls for prompt sensitivity, decoding differences, or calibration shifts between passes; without this, the per-question pairing may misalign and the correlation may reflect behavioral artifacts rather than knowledge awareness.

Authors: We agree that the validity of the forced-answer pass as a proxy for knowledge state is a key assumption. The two-pass design uses identical base prompts and decoding parameters across passes, differing only in the explicit instruction to answer (versus allowing refusal), which is intended to surface the model's underlying knowledge without introducing new behavioral biases. The observed stability of RI rankings across 16 models and 5 datasets provides indirect empirical support that the metric is not dominated by pass-specific artifacts. In revision we will add an explicit discussion of this assumption in §3, including a small-scale sensitivity analysis on prompt phrasing for the forced-answer pass on one dataset to quantify any shifts. revision: partial
Referee: [§4] §4 (Experiments and stability claims): The abstract and results assert stability across refusal rates and consistent rankings independent of accuracy, yet no details are given on how per-question error probability is computed from observed outcomes, any post-hoc choices in aggregation, or statistical significance tests for the reported correlations and stability. This weakens support for the claim that RI isolates the intended capability.

Authors: We will expand §4 to include the precise computation: per-question error probability is the fraction of incorrect answers obtained in the forced-answer pass (averaged over 3 independent generations per question to reduce sampling noise). No post-hoc filtering or aggregation choices beyond direct use of observed rates are applied. We will also report bootstrap-derived 95% confidence intervals on the Spearman correlations and on the stability of model rankings across refusal-rate strata to substantiate the independence claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity: RI is a direct empirical correlation metric computed from observed data

full rationale

The paper defines RI explicitly as Spearman's rank correlation between per-question refusal probability and error probability, measured via a two-pass evaluation on standard runs (one observing refusals, one deriving errors against ground truth). This is a straightforward statistical computation from independent observations rather than a derivation that reduces to fitted parameters, self-referential definitions, or self-citation chains. No equations in the provided text show the metric being constructed from itself or renamed known results; claims of stability and consistent rankings are presented as empirical findings from experiments across models and datasets, not forced by the definition. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard statistical correlation and the assumption that two evaluation passes suffice to estimate both refusal and error probabilities. No free parameters are fitted to data. The main invented element is the RI metric itself, which lacks independent external validation beyond the reported experiments.

axioms (1)

standard math Spearman's rank correlation coefficient measures monotonic relationship between two ranked variables
Directly used to define RI from refusal and error probabilities.

invented entities (1)

Refusal Index (RI) no independent evidence
purpose: Quantify knowledge-aware refusal capability of LLMs
Newly defined metric whose validity is supported only by the paper's internal experiments.

pith-pipeline@v0.9.0 · 5742 in / 1359 out tokens · 39017 ms · 2026-05-21T21:55:00.606869+00:00 · methodology

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)