Double-Calibration: Towards Reliable LLMs via Calibrating Knowledge and Reasoning Confidence

Fu Lee Wang; Qing Li; Wenqi Fan; Yanghui Rao; Yuyin Lu; Ziran Liang

arxiv: 2601.11956 · v2 · pith:S4NH5EEInew · submitted 2026-01-17 · 💻 cs.CL · cs.AI

Double-Calibration: Towards Reliable LLMs via Calibrating Knowledge and Reasoning Confidence

Yuyin Lu , Ziran Liang , Yanghui Rao , Wenqi Fan , Fu Lee Wang , Qing Li This is my paper

Pith reviewed 2026-05-21 15:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsknowledge graphscalibrationhallucinationblack-box modelsreasoning confidenceevidence uncertaintyreliable predictions

0 comments

The pith

Double-calibration with a proxy model raises accuracy and calibration in black-box LLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DoublyCal to reduce hallucinations in LLMs when handling knowledge-intensive questions. It applies a double-calibration principle: a lightweight proxy model first produces knowledge graph evidence together with a calibrated confidence score for that evidence. This evidence then steers a black-box LLM, so the final answers gain accuracy while their reported confidence becomes traceable to the uncertainty in the evidence. A sympathetic reader cares because current LLMs often output high-confidence errors, and this method promises better reliability without large extra computation.

Core claim

DoublyCal is built on a novel double-calibration principle. It employs a lightweight proxy model to first generate KG evidence alongside a calibrated evidence confidence. This calibrated supporting evidence then guides a black-box LLM, yielding final predictions that are not only more accurate but also well-calibrated, with confidence scores traceable to the uncertainty of the supporting evidence.

What carries the argument

Double-calibration principle: calibrating evidence confidence in a proxy model before using that signal to calibrate the reasoning confidence of the target black-box LLM.

If this is right

Black-box LLMs achieve higher accuracy on knowledge-intensive benchmarks.
The confidence scores of the LLMs become significantly better calibrated.
Token cost remains low compared with other augmentation approaches.
Final confidence scores can be directly traced to uncertainty in the supporting KG evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy-based calibration step could be tested with non-KG evidence sources such as retrieved documents.
Delegating calibration to a small proxy may reduce the need to fine-tune the largest LLMs for reliability.
Repeated application across successive queries could create running reliability estimates for deployed models.

Load-bearing premise

The calibrated confidence scores produced by the lightweight proxy model transfer effectively to guide the black-box LLM without introducing substantial new errors or calibration mismatches.

What would settle it

On the same knowledge-intensive benchmarks, replacing the proxy's calibrated evidence with uncalibrated evidence and observing no drop in accuracy or calibration metrics would show the double-calibration step adds no value.

read the original abstract

Reliable reasoning in Large Language Models (LLMs) is challenged by their propensity for hallucination. While augmenting LLMs with Knowledge Graphs (KGs) improves factual accuracy, existing KG-augmented methods fail to quantify epistemic uncertainty in both the retrieved evidence and LLMs' reasoning. To bridge this gap, we introduce DoublyCal, a framework built on a novel double-calibration principle. DoublyCal employs a lightweight proxy model to first generate KG evidence alongside a calibrated evidence confidence. This calibrated supporting evidence then guides a black-box LLM, yielding final predictions that are not only more accurate but also well-calibrated, with confidence scores traceable to the uncertainty of the supporting evidence. Experiments on knowledge-intensive benchmarks show that DoublyCal significantly improves both the accuracy and confidence calibration of black-box LLMs while maintaining low token cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DoublyCal's proxy-based double calibration is a straightforward way to link KG evidence uncertainty to LLM outputs, but the transfer step and experimental details need closer checking to confirm the gains.

read the letter

The main point is that this paper puts forward DoublyCal, where a lightweight proxy model first pulls knowledge graph evidence and attaches its own calibrated confidence score, then passes that to a black-box LLM to produce answers that are more accurate and better calibrated overall. The traceable link from evidence uncertainty to final confidence is the central new piece, and the low token cost keeps it practical for larger models. Experiments on knowledge benchmarks are said to show clear improvements in both accuracy and calibration metrics. That framing addresses a real gap, since most KG augmentation work focuses on facts but leaves uncertainty unhandled at both stages. The approach earns credit for keeping the method simple enough to apply without heavy changes to the main LLM. The logic is consistent on paper and avoids obvious circularity in how calibration is described. The soft spots sit in the proxy transfer and the strength of the supporting evidence. It is not guaranteed that scores from the small model will map cleanly onto the larger model's output distribution without fresh mismatches or overconfidence creeping in. The abstract highlights significant gains, yet without full details on exact baselines, how calibration error is quantified, ablation tests that separate the confidence component from plain retrieval, or any error analysis, it is hard to judge whether the results hold up broadly or depend on particular dataset quirks. If those sections are thin, the central claim loses some weight. This paper is for people building or studying reliable LLM systems that rely on external knowledge, especially those already working on calibration or retrieval methods. A reader who wants concrete ideas for traceable uncertainty in augmented generation could pick up useful elements to test. It deserves peer review because the problem is practical and the proposal is specific enough for referees to evaluate the methods and results directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces DoublyCal, a double-calibration framework for reliable LLM reasoning on knowledge-intensive tasks. A lightweight proxy model first generates KG evidence together with a calibrated evidence confidence score; this evidence and score then condition a black-box LLM to produce final predictions that are claimed to be both more accurate and better calibrated, with confidence traceable to evidence uncertainty, all while keeping token cost low.

Significance. If the experimental claims hold, the approach offers an efficient route to quantify epistemic uncertainty in both retrieved evidence and LLM reasoning steps, extending standard KG-augmentation methods. The proxy-based design could reduce hallucinations without requiring white-box access or high inference cost.

major comments (2)

[§4] §4 (Experimental Setup): the abstract and results claim 'significant improvements' in both accuracy and calibration, yet no specific baselines (e.g., vanilla RAG, temperature scaling, or other KG-augmented methods), no statistical significance tests, and no error bars or ablation isolating the calibration component versus plain evidence augmentation are described; this leaves the central empirical claim unsupported.
[§3.2] §3.2 (Proxy-to-LLM Transfer): the weakest assumption—that calibrated confidence scores from the proxy transfer to the black-box LLM without introducing new calibration mismatches—is not tested; an ablation comparing (evidence + raw score) versus (evidence + calibrated score) is required to establish that the double-calibration step, rather than KG augmentation alone, drives the reported gains.

minor comments (2)

[Notation] Notation for the two confidence quantities (evidence confidence vs. final reasoning confidence) should be made consistent across equations and figures.
[Results] Figure 2 or the corresponding results table should report token counts explicitly for all compared methods to substantiate the 'low token cost' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical validation of DoublyCal. We address each major comment below and commit to revisions that will better support our claims without misrepresenting the current results.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): the abstract and results claim 'significant improvements' in both accuracy and calibration, yet no specific baselines (e.g., vanilla RAG, temperature scaling, or other KG-augmented methods), no statistical significance tests, and no error bars or ablation isolating the calibration component versus plain evidence augmentation are described; this leaves the central empirical claim unsupported.

Authors: We agree that the experimental section would benefit from additional rigor to substantiate the claims of significant improvements. In the revised manuscript, we will expand §4 to include explicit comparisons against vanilla RAG, temperature scaling, and other relevant KG-augmented baselines. We will also report results with error bars from multiple random seeds, conduct statistical significance tests (such as paired t-tests with p-values), and add an ablation isolating the calibration component from plain evidence augmentation. These changes will provide clearer evidence for the reported gains in accuracy and calibration. revision: yes
Referee: [§3.2] §3.2 (Proxy-to-LLM Transfer): the weakest assumption—that calibrated confidence scores from the proxy transfer to the black-box LLM without introducing new calibration mismatches—is not tested; an ablation comparing (evidence + raw score) versus (evidence + calibrated score) is required to establish that the double-calibration step, rather than KG augmentation alone, drives the reported gains.

Authors: We recognize that directly testing the value of the calibrated scores over raw scores is necessary to isolate the effect of double-calibration. In the revision, we will add the requested ablation in §3.2 (or a new subsection) comparing (evidence + raw proxy score) against (evidence + calibrated score) when conditioning the black-box LLM. This will help confirm that the calibration step contributes to the improvements beyond standard KG augmentation and will address potential mismatches in score transfer. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces DoublyCal as a methodological framework in which a lightweight proxy first produces KG evidence together with a calibrated confidence score that conditions a black-box LLM. The abstract and high-level description contain no equations, no fitted parameters renamed as predictions, and no self-citation chains or uniqueness theorems that reduce the central claim to its own inputs by construction. The double-calibration principle is presented as an external methodological step whose effectiveness is asserted via benchmark experiments rather than tautological re-derivation. The argument is therefore self-contained against external validation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework relies on standard KG retrieval and calibration concepts whose details are not specified here.

pith-pipeline@v0.9.0 · 5684 in / 1040 out tokens · 51359 ms · 2026-05-21T15:49:21.611522+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DoublyCal employs a lightweight proxy model to first generate KG evidence alongside a calibrated evidence confidence... Bayesian calibration of KG evidence... Beta-Bernoulli model... p(A|zQ) = ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.