pith. sign in

arxiv: 2510.01782 · v2 · pith:VWO2OWZ6new · submitted 2025-10-02 · 💻 cs.CL · cs.AI

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Pith reviewed 2026-05-21 21:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Refusal Indexknowledge-aware refusalLLM factualityfactual tasksmodel calibrationSpearman correlationrefusal behaviorerror probability
0
0 comments X

The pith

The Refusal Index measures how well large language models refuse factual questions they do not know by correlating refusal rates with error rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Refusal Index to quantify knowledge-aware refusal, the ability of LLMs to decline questions outside their knowledge on factual tasks. It defines this index as Spearman's rank correlation between refusal probability and error probability, computed through a simple two-pass evaluation on existing datasets. A reader would care because LLMs frequently produce incorrect answers instead of refusing, which undermines trust in their factual outputs. The approach requires no extra ground-truth labels and yields stable results across models. Experiments on 16 models and 5 datasets show that refusal behavior forms a distinct, fragile property separate from overall accuracy.

Core claim

We define the Refusal Index as Spearman's rank correlation between a model's refusal probability and its error probability on factual questions. This index is obtained via a lightweight two-pass method that uses only observed refusal rates from standard evaluation runs. Across tested models and datasets, the index accurately captures knowledge-aware refusal, remains stable regardless of chosen refusal rates, and produces consistent model orderings that do not depend on overall accuracy levels. The results indicate that even high-accuracy models exhibit unreliable refusal on questions they do not know.

What carries the argument

The Refusal Index, computed as Spearman's rank correlation between refusal probability and error probability on factual questions, which isolates the degree to which refusals align with actual knowledge gaps.

If this is right

  • High factual accuracy does not guarantee reliable refusal behavior on unknown questions.
  • Model rankings by knowledge-aware refusal stay consistent even when overall accuracy or refusal thresholds vary.
  • Refusal behavior can be assessed separately from accuracy using only standard evaluation data.
  • The index identifies models whose refusals better match their actual knowledge limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers might optimize training to raise the index and reduce confident wrong answers.
  • The same correlation approach could be tested on reasoning or code tasks to check for similar knowledge-awareness gaps.
  • Tracking changes in the index after safety fine-tuning could show whether alignment improves or harms refusal calibration.

Load-bearing premise

Observed error rates from ordinary evaluation runs reliably estimate a model's true knowledge gaps without separate verification of each answer.

What would settle it

Apply an independent verification pass with external fact-checking to the same set of questions and test whether the correlation between refusals and verified unknowns disappears or reverses.

read the original abstract

Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability, while existing metrics fail to capture this ability. In this work, we propose the Refusal Index (RI), a novel and principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. RI is practically measurable with a lightweight two-pass evaluation method which only require observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's knowledge-aware refusal capability. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. These properties suggest RI captures a stable, intrinsic aspect of model knowledge calibration. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Refusal Index (RI), defined as Spearman's rank correlation between per-question refusal probability and per-question error probability, to quantify knowledge-aware refusal in LLMs. It proposes a lightweight two-pass evaluation that derives the metric from observed refusal rates in standard runs (one allowing refusal, one forced-answer). Experiments across 16 models and 5 factual datasets claim that RI accurately measures the capability, remains stable across refusal rates, yields consistent model rankings independent of overall accuracy and refusal rates, and reveals that high factual accuracy does not imply reliable refusal behavior.

Significance. If the measurement assumptions hold, RI would fill a gap in factuality evaluation by isolating how well refusal decisions track actual knowledge gaps rather than generic difficulty or behavioral biases. The reported stability and independence from accuracy/refusal rates would indicate that the metric captures an intrinsic calibration property. The scale of the evaluation (16 models, 5 datasets) provides a solid empirical foundation for these observations and could guide future work on safer factual QA systems.

major comments (2)
  1. [§3] §3 (RI Definition and two-pass method): The central claim that RI 'accurately quantifies' knowledge-aware refusal depends on error probability from the forced-answer pass serving as a valid proxy for the model's knowledge state in the refusal-allowed setting. The manuscript provides no verification or controls for prompt sensitivity, decoding differences, or calibration shifts between passes; without this, the per-question pairing may misalign and the correlation may reflect behavioral artifacts rather than knowledge awareness.
  2. [§4] §4 (Experiments and stability claims): The abstract and results assert stability across refusal rates and consistent rankings independent of accuracy, yet no details are given on how per-question error probability is computed from observed outcomes, any post-hoc choices in aggregation, or statistical significance tests for the reported correlations and stability. This weakens support for the claim that RI isolates the intended capability.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'which only require observed refusal rates' contains a subject-verb agreement error and should read 'which only requires observed refusal rates'.
  2. The manuscript would benefit from explicitly listing the exact prompts and decoding parameters used in each of the two evaluation passes to support reproducibility of the RI computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of the Refusal Index (RI) definition and experimental reporting that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (RI Definition and two-pass method): The central claim that RI 'accurately quantifies' knowledge-aware refusal depends on error probability from the forced-answer pass serving as a valid proxy for the model's knowledge state in the refusal-allowed setting. The manuscript provides no verification or controls for prompt sensitivity, decoding differences, or calibration shifts between passes; without this, the per-question pairing may misalign and the correlation may reflect behavioral artifacts rather than knowledge awareness.

    Authors: We agree that the validity of the forced-answer pass as a proxy for knowledge state is a key assumption. The two-pass design uses identical base prompts and decoding parameters across passes, differing only in the explicit instruction to answer (versus allowing refusal), which is intended to surface the model's underlying knowledge without introducing new behavioral biases. The observed stability of RI rankings across 16 models and 5 datasets provides indirect empirical support that the metric is not dominated by pass-specific artifacts. In revision we will add an explicit discussion of this assumption in §3, including a small-scale sensitivity analysis on prompt phrasing for the forced-answer pass on one dataset to quantify any shifts. revision: partial

  2. Referee: [§4] §4 (Experiments and stability claims): The abstract and results assert stability across refusal rates and consistent rankings independent of accuracy, yet no details are given on how per-question error probability is computed from observed outcomes, any post-hoc choices in aggregation, or statistical significance tests for the reported correlations and stability. This weakens support for the claim that RI isolates the intended capability.

    Authors: We will expand §4 to include the precise computation: per-question error probability is the fraction of incorrect answers obtained in the forced-answer pass (averaged over 3 independent generations per question to reduce sampling noise). No post-hoc filtering or aggregation choices beyond direct use of observed rates are applied. We will also report bootstrap-derived 95% confidence intervals on the Spearman correlations and on the stability of model rankings across refusal-rate strata to substantiate the independence claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity: RI is a direct empirical correlation metric computed from observed data

full rationale

The paper defines RI explicitly as Spearman's rank correlation between per-question refusal probability and error probability, measured via a two-pass evaluation on standard runs (one observing refusals, one deriving errors against ground truth). This is a straightforward statistical computation from independent observations rather than a derivation that reduces to fitted parameters, self-referential definitions, or self-citation chains. No equations in the provided text show the metric being constructed from itself or renamed known results; claims of stability and consistent rankings are presented as empirical findings from experiments across models and datasets, not forced by the definition. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard statistical correlation and the assumption that two evaluation passes suffice to estimate both refusal and error probabilities. No free parameters are fitted to data. The main invented element is the RI metric itself, which lacks independent external validation beyond the reported experiments.

axioms (1)
  • standard math Spearman's rank correlation coefficient measures monotonic relationship between two ranked variables
    Directly used to define RI from refusal and error probabilities.
invented entities (1)
  • Refusal Index (RI) no independent evidence
    purpose: Quantify knowledge-aware refusal capability of LLMs
    Newly defined metric whose validity is supported only by the paper's internal experiments.

pith-pipeline@v0.9.0 · 5742 in / 1359 out tokens · 39017 ms · 2026-05-21T21:55:00.606869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.