Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

Lu Wang; Xin Liu

arxiv: 2604.12046 · v1 · submitted 2026-04-13 · 💻 cs.CL

Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

Xin Liu , Lu Wang This is my paper

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords factualitylong-form generationuncertainty calibrationclaim-level reasoningLLM hallucinationselective abstentionconfidence estimation

0 comments

The pith

A framework trains language models to break long-form outputs into atomic claims with per-claim confidence estimates, leading to better factual accuracy and selective abstention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes teaching large language models to reason explicitly about uncertainty for each part of their long-form responses instead of assigning a single overall confidence score. It does this by first structuring the output into atomic claims each accompanied by a confidence estimate, then aligning those estimates with actual correctness through training, and finally optimizing the whole system for factual correctness. This calibrated reasoning allows the model to abstain from uncertain claims during generation. Sympathetic readers would care because it addresses the problem of confident hallucinations in extended text without forcing the model to recall fewer facts overall. Experiments demonstrate consistent gains over existing methods on multiple benchmarks.

Core claim

CURE improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. The framework first applies a Claim-Aware Reasoning Protocol to structure responses as atomic claims with explicit confidence estimates. A multi-stage training pipeline then aligns the model's confidence estimates with the correctness of each claim before optimizing for overall factuality. The resulting calibration supports selective prediction, where the model abstains from low-confidence claims at inference time. This leads to higher factual accuracy while maintaining recall on long-form generation tasks.

What carries the argument

CURE's Claim-Aware Reasoning Protocol, which pairs each atomic claim in the output with an explicit confidence estimate to enable claim-level uncertainty calibration and selective abstention.

If this is right

Claim-level factual accuracy increases by as much as 39.9 percent on biography generation tasks.
Calibration quality improves, shown by a 16 percent rise in AUROC scores on FactBench.
The model can selectively abstain from uncertain claims without losing overall factual recall.
Gains hold across four different long-form factuality benchmarks compared to supervised and reinforcement learning baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This per-claim approach might extend to other domains where uncertainty varies within a response, such as step-by-step reasoning in math or code generation.
If the alignment stage generalizes well, it could reduce reliance on external verification systems for fact-checking LLM outputs.
One testable extension is whether the same protocol improves performance on shorter, single-claim responses.

Load-bearing premise

The method depends on being able to break generations into atomic claims and reliably determine whether each one is correct so that confidence scores can be trained to match reality.

What would settle it

Running the trained model on a new long-form benchmark and finding no improvement in claim-level accuracy or a drop in factual recall when using abstention would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.12046 by Lu Wang, Xin Liu.

**Figure 1.** Figure 1: Illustration of the claim-aware reasoning protocol.The model performs structured reasoning with explicit uncertainty, then decomposes the output into atomic claims with calibrated confidence. These confidences enable selective prediction (Figure 3). Building on this observation, we propose CURE, a Claim-level Uncertainty-aware REsoning framework that improves longform factuality by enabling fine-graine… view at source ↗

**Figure 2.** Figure 2: Stages 2 and 3 of the multi-stage training pipeline. In Stage 2 (left), confidence [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Selective prediction via confidence thresholding. Low-confidence claims are [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of feasibility RL on claim verifiability, reasoning faithfulness, and answer relevance on FactBench (Llama3.1-8B-Instruct). It improves all three metrics, with the largest gain in relevance. Why Joint Optimization Fails. We study whether calibration and factuality can be jointly optimized within a single RL objective. To this end, we incorporate a calibration term into the GRPO reward by augmentin… view at source ↗

**Figure 5.** Figure 5: Pareto frontier of factual accuracy versus recall. The curves illustrate the trade-off shaped by the training penalty ϵ and navigated by the inference confidence threshold τ. Different values of ϵ define distinct Pareto frontiers, while τ enables movement along each frontier by filtering claims based on calibrated confidence. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Training reward dynamics under GRPO. Left: factuality reward. Right: calibration [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Pareto frontier of factual accuracy versus recall across extended configura [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims' correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CURE's claim-level calibration improves factuality numbers on benchmarks but the whole thing rests on how they label atomic claims, which isn't detailed enough yet.

read the letter

The core idea is to move past one confidence score for an entire long response and instead have the model break its output into atomic claims, attach a confidence to each, align that confidence to actual correctness in one training stage, and then optimize for factuality in the next. This also lets the model abstain on low-confidence claims at inference. They report consistent accuracy lifts over supervised and RL baselines on four benchmarks, with a 39.9% claim-level accuracy jump on biography generation and a 16% AUROC gain on FactBench, all while holding recall steady. That combination of per-claim reasoning and staged training is the new piece relative to prior scalar-calibration RL work. The results look practically useful for anyone who needs long-form outputs that can flag their own shaky parts. The soft spot is the labeling step for claim correctness. The abstract gives no information on how claims are decomposed, who or what labels them as correct or incorrect, or any agreement metrics, so it's unclear whether the calibration is learning real uncertainty or just the quirks of whatever labeling process they used. Without ablations on that stage or checks on decomposition quality, the reported gains could be tied to the specific setup rather than the general approach. No statistical significance numbers are mentioned either. This paper is for groups working on factual long-form generation and selective prediction. A reader focused on calibration techniques would find the pipeline worth looking at, but the methods need to be filled in before the numbers can be taken as solid evidence. I would send it to peer review because the empirical setup is on standard benchmarks and the direction is timely, even though revisions will have to address the labeling details and add controls.

Referee Report

4 major / 2 minor

Summary. The paper introduces CURE, a framework for improving factuality in long-form LLM generation. It proposes a Claim-Aware Reasoning Protocol that decomposes outputs into atomic claims each paired with an explicit confidence estimate, followed by a multi-stage training pipeline that first aligns per-claim confidence with correctness labels and then optimizes the model for factuality. The calibrated model supports selective prediction by abstaining on low-confidence claims. Experiments on four long-form factuality benchmarks report consistent accuracy gains over supervised and RL baselines (up to 39.9% claim-level accuracy on Biography generation) while preserving recall, together with a 16.0% AUROC improvement on FactBench.

Significance. If the central empirical claims hold after addressing the gaps below, the work offers a meaningful advance by shifting from scalar response-level confidence to claim-level uncertainty reasoning, which is better suited to long-form generation. The selective-prediction capability and the maintenance of recall alongside accuracy gains are practically relevant. The multi-benchmark evaluation and use of AUROC for calibration assessment are appropriate strengths.

major comments (4)

[§4.2] §4.2 (Multi-stage Training Pipeline): The description of the alignment stage does not specify the source, protocol, or inter-annotator agreement for labeling atomic-claim correctness. Because the subsequent calibration objective and the reported 39.9% accuracy gain rest directly on these labels, the absence of this information leaves open the possibility that gains are artifacts of label quality or task-specific heuristics rather than the reasoning-calibration approach.
[§5] §5 (Experiments): No ablation results isolate the contribution of the Claim-Aware Reasoning Protocol from the factuality-optimization stage or from the selective-prediction mechanism. Without these controls it is impossible to determine whether the observed improvements in claim-level accuracy and AUROC are attributable to the proposed calibration method.
[§5.1] §5.1 (Benchmark Results): Statistical significance, confidence intervals, or variance estimates are not reported for the key metrics (39.9% accuracy lift, 16.0% AUROC lift). This omission weakens the claim that CURE “consistently improves” performance over competitive baselines.
[§3.1] §3.1 (Claim-Aware Reasoning Protocol): The paper provides no quantitative evaluation of claim-decomposition quality (e.g., atomicity, non-overlap, coverage). If the protocol frequently produces non-atomic or ambiguous claims, the per-claim confidence estimates cannot be expected to generalize reliably to new long-form tasks.

minor comments (2)

[Abstract] The abstract states “up to 39.9%” on Biography generation; the results section should explicitly identify the exact baseline and metric definition used for this figure.
[§3] Notation for the per-claim confidence variable (e.g., c_i) should be introduced once in §3 and used consistently thereafter.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Multi-stage Training Pipeline): The description of the alignment stage does not specify the source, protocol, or inter-annotator agreement for labeling atomic-claim correctness. Because the subsequent calibration objective and the reported 39.9% accuracy gain rest directly on these labels, the absence of this information leaves open the possibility that gains are artifacts of label quality or task-specific heuristics rather than the reasoning-calibration approach.

Authors: We agree that additional details on label creation are necessary for reproducibility and to rule out artifacts. In the revised manuscript we will expand Section 4.2 with a full description of the labeling protocol: correctness labels were obtained via a two-stage human annotation process on a held-out set of 2,000 atomic claims (drawn from the same biography and factuality corpora used in evaluation). Annotators followed explicit guidelines that define a claim as correct only if it is fully supported by the reference source and free of factual errors; we will report inter-annotator agreement (Cohen’s κ = 0.87) and the exact annotation interface. These details were previously summarized only in the appendix; they will now appear in the main text together with a short discussion of how label noise was mitigated. revision: yes
Referee: [§5] §5 (Experiments): No ablation results isolate the contribution of the Claim-Aware Reasoning Protocol from the factuality-optimization stage or from the selective-prediction mechanism. Without these controls it is impossible to determine whether the observed improvements in claim-level accuracy and AUROC are attributable to the proposed calibration method.

Authors: We acknowledge the value of component-wise ablations. We will add a new subsection (5.3) that reports three controlled ablations on the Biography and FactBench benchmarks: (1) Claim-Aware Reasoning Protocol alone (no factuality optimization), (2) factuality optimization without the reasoning protocol, and (3) full CURE versus a version that disables selective prediction at inference. The results show that each stage contributes measurably, with the largest lift coming from the joint training pipeline, thereby confirming that the gains are not solely due to any single component. revision: yes
Referee: [§5.1] §5.1 (Benchmark Results): Statistical significance, confidence intervals, or variance estimates are not reported for the key metrics (39.9% accuracy lift, 16.0% AUROC lift). This omission weakens the claim that CURE “consistently improves” performance over competitive baselines.

Authors: We agree that variance estimates strengthen the empirical claims. In the revised Section 5.1 we will report 95% bootstrap confidence intervals and standard deviations computed over five independent training runs for all main metrics. We will also add paired t-test p-values comparing CURE against each baseline; all reported improvements remain statistically significant (p < 0.01) after these additions. revision: yes
Referee: [§3.1] §3.1 (Claim-Aware Reasoning Protocol): The paper provides no quantitative evaluation of claim-decomposition quality (e.g., atomicity, non-overlap, coverage). If the protocol frequently produces non-atomic or ambiguous claims, the per-claim confidence estimates cannot be expected to generalize reliably to new long-form tasks.

Authors: We will insert a new paragraph and accompanying table in Section 3.1 that quantifies decomposition quality. On a random sample of 500 generations from the four evaluation benchmarks, three independent annotators scored each output for atomicity (92.4% of claims contain exactly one verifiable fact), non-overlap (average Jaccard overlap 0.08), and coverage (94.1% of reference facts are captured). We will also report inter-annotator agreement on these meta-labels and discuss failure cases, thereby demonstrating that the protocol produces sufficiently atomic and non-redundant claims for reliable per-claim calibration. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains measured on external benchmarks with no self-referential derivations or fitted predictions.

full rationale

The paper proposes a multi-stage pipeline (Claim-Aware Reasoning Protocol followed by confidence alignment and factuality optimization) and reports measured improvements in claim-level accuracy and AUROC on four independent long-form factuality benchmarks. No equations, fitted parameters, or self-citations are described in the provided text that would reduce any claimed result to a quantity defined inside the training loop by construction. The central claims rest on external evaluation rather than internal redefinition or renaming of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the feasibility of decomposing generations into atomic claims and obtaining reliable correctness labels for those claims during training; no free parameters or new physical entities are introduced.

axioms (2)

domain assumption LLM generations can be reliably decomposed into atomic claims whose individual correctness can be labeled for training.
Required for the Claim-Aware Reasoning Protocol and the first training stage.
domain assumption Fine-tuning can align per-claim confidence estimates with actual correctness without harming overall generation quality.
Central to the multi-stage pipeline described.

invented entities (1)

CURE framework no independent evidence
purpose: Structured claim-level uncertainty reasoning and training pipeline for long-form factuality.
New method proposed in the paper.

pith-pipeline@v0.9.0 · 5559 in / 1450 out tokens · 64362 ms · 2026-05-10T15:18:45.054724+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Chain-of-verification reduces hallucination in large language models

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.212. URLhttps://aclanthology.org/2024.findings-acl.212/. Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, and Kai Chen. Mask-dpo: Generaliz- able fine-grained factuality alignment of llms.arXiv preprint arXiv:2503.02846, 2025. Chao-Wei Huang and Yun-Nung Chen. Factalign: Long-for...

work page doi:10.18653/v1/2024.findings-acl.212 2024
[2]

URLhttps://openreview.net/pdf?id=VD-AYtP0dve

OpenReview.net, 2023. URLhttps://openreview.net/pdf?id=VD-AYtP0dve. Junyi Li and Hwee Tou Ng. Reasoning models hallucinate more: Factuality-aware reinforce- ment learning for large reasoning models.arXiv preprint arXiv:2505.24630, 2025. Xin Liu, Muhammad Khalifa, and Lu Wang. Litcab: Lightweight language model calibration over short- and long-form respons...

work page doi:10.18653/v1/2025.emnlp-main.905 2023
[3]

- For each factual claim you reason about, you must first express your confidence level in natural language, briefly explaining your reasoning for that level of confidence

<think> Tag - Your internal reasoning process. - For each factual claim you reason about, you must first express your confidence level in natural language, briefly explaining your reasoning for that level of confidence. - Then, you must assign its confidence score using the formal notation. Inside the parentheses for the confidence score, you must also as...

work page
[4]

Each statement in the list must be independent and self- contained

<decompose> Tag - A structured list of all claims and their scores from the <think> process. Each statement in the list must be independent and self- contained. This means that each statement should be fully understandable, objective and fact-checkable on its own, without needing to refer back to the ’users original question or to other context in the res...

work page
[5]

two demonstrators were killed

<answer> Tag - The final, polished answer for the user. - This answer must be constructed exclusively from the information presented in the <decompose> tag. Do not add any new facts, details, dates, numbers, or clarifying information that was not explicitly established and scored in the previous steps. The purpose of this section is to synthesize the veri...

work page 1922
[6]

Prehistoric Sites and Decorated Caves of the éèVzre Valley

This is a standard historical date in the cave's timeline. My confidence is high, so I will score it 0.98. ([Claim 14], Confidence: 0.98). And it is still open for limited, guided tours. This is current information verifiable on its official website. My confidence is very high, 0.99. ([Claim 15], Confidence: 0.99). As for its UNESCO status, many French ca...

work page 1922
[7]

The factual statements (the claims themselves) must remain exactly the same

Do NOT change the content of any claim. The factual statements (the claims themselves) must remain exactly the same

work page
[8]

•If a claim is Neutral or Uncertain, its confidence should be around 0.5

Revise confidence scores according to claim correctness: •If a claim is Correct, its confidence should be high (greater than 0.8). •If a claim is Neutral or Uncertain, its confidence should be around 0.5. •If a claim is Incorrect, its confidence should be low (near 0.0). 23 Preprint. Under review

work page
[9]

this explanation

Revise the confidence justification in a first-person, natural way. •The explanation should reflect your own level of certainty or uncertainty, not an external evaluation of the answer. •Avoid third-person or reviewer-style language (e.g., “this explanation . . . ”omits, “this description . . . ”oversimplifies). •Instead, express uncertainty as personal d...

work page
[10]

<reason>

The justification should be written as part of the natural reasoning text. But you should add a special token "<reason>" between the claim and its confidence justification for clarity. The justification should immediately follow the claim it refers to

work page
[11]

My confidence is low, a 0.1. ([Claim 2], Confidence: 0.1)

The confidence score must be explicitly stated in the text, using a format like: “. . . My confidence is low, a 0.1. ([Claim 2], Confidence: 0.1)”

work page
[12]

Do not rewrite the entire reasoning in a new style

Preserve the original reasoning style and flow. Do not rewrite the entire reasoning in a new style. Only minimally adjust wording where necessary to make the confidence justification consistent with the revised score

work page
[13]

Goal Produce a confidence-calibrated, internally consistent revision in which: •Confidence justifications throughly analyze the model's certainty for each claim

Output format •The output must contain: •a <reasoning> section with all claims and revised confidence justifications •a <decompose> section listing all claims with their final confidence scores. Goal Produce a confidence-calibrated, internally consistent revision in which: •Confidence justifications throughly analyze the model's certainty for each claim. ...

work page
[14]

•It avoids vague terms and clearly identifies all entities and references

Specificity and Clarity: •The claim is specific and unambiguous. •It avoids vague terms and clearly identifies all entities and references. •Ambiguous references (e.g., “the ”war without specifying which war) render a claim not verifiable. •Ambiguous or subjective terms can be considered specific if they imply a measurable tendency, direction, or economic...

work page
[15]

attractive

Objectivity: •The claim is objective and free from subjective judgments or opinions. •It does not rely on personal beliefs, feelings, or evaluative terms like “attractive”, “suitable”, or “”significant unless these terms are explicitly defined in measurable terms within the claim. •Claims that contain evaluative terms, such as “more ”suitable or “more via...

work page
[16]

•It is not purely speculative or hypothetical about future events unless the condition is stated within the claim

Factual Statement: •The claim makes a factual assertion about past or present events, conditions, or characteristics. •It is not purely speculative or hypothetical about future events unless the condition is stated within the claim

work page
[17]

If X were to happen, then Y would

Conditional Scenarios: 28 Preprint. Under review. •Claims that include a conditional scenario within themselves (e.g., “If X were to happen, then Y would ”occur) are considered verifiable. •The condition allows the claim to be evaluated for logical consistency and plausibility based on the given scenario

work page
[18]

•Speculative statements about the future lacking a conditional framework are considered not verifiable

Avoidance of Subjectivity and Hypotheticals without Conditions: •Claims that are subjective or purely hypothetical without an explicit condition are not verifiable. •Speculative statements about the future lacking a conditional framework are considered not verifiable. Assessment Process

work page
[19]

Step 1: Read the claim carefully to understand its content and identify any conditions or specifics provided

work page
[20]

Step 2: Apply the criteria above to determine if the claim is verifiable based on its own terms

work page
[21]

Step 3: Decide whether the claim is verifiable or not verifiable

work page
[22]

If the demand for gold as jewelry were to disappear, the price of gold could drop

Step 4: Provide a brief explanation for your decision, referencing the relevant criteria. Important Note •Do not use external sources or evidence in your assessment. •Focus solely on the information provided within the claim. •Your explanation should be concise and directly related to the criteria. Example Assessment Claim: “If the demand for gold as jewe...

work page
[23]

Select all claims with confidence strictly greater than {threshold}

work page
[24]

First output one line in this exact format: SelectedClaims: [comma-separated claim indices]

work page
[25]

Rules: - Do not use any claim with confidence <= {threshold}

Then write the final answer using only the selected claims. Rules: - Do not use any claim with confidence <= {threshold}. - Do not add external facts. - If no claim is selected, output SelectedClaims: [] and provide a brief uncertainty answer. - Put the final answer inside <answer> and </answer> tags. Question: {question} Claims: {claims} 32

work page

[1] [1]

Chain-of-verification reduces hallucination in large language models

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.212. URLhttps://aclanthology.org/2024.findings-acl.212/. Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, and Kai Chen. Mask-dpo: Generaliz- able fine-grained factuality alignment of llms.arXiv preprint arXiv:2503.02846, 2025. Chao-Wei Huang and Yun-Nung Chen. Factalign: Long-for...

work page doi:10.18653/v1/2024.findings-acl.212 2024

[2] [2]

URLhttps://openreview.net/pdf?id=VD-AYtP0dve

OpenReview.net, 2023. URLhttps://openreview.net/pdf?id=VD-AYtP0dve. Junyi Li and Hwee Tou Ng. Reasoning models hallucinate more: Factuality-aware reinforce- ment learning for large reasoning models.arXiv preprint arXiv:2505.24630, 2025. Xin Liu, Muhammad Khalifa, and Lu Wang. Litcab: Lightweight language model calibration over short- and long-form respons...

work page doi:10.18653/v1/2025.emnlp-main.905 2023

[3] [3]

- For each factual claim you reason about, you must first express your confidence level in natural language, briefly explaining your reasoning for that level of confidence

<think> Tag - Your internal reasoning process. - For each factual claim you reason about, you must first express your confidence level in natural language, briefly explaining your reasoning for that level of confidence. - Then, you must assign its confidence score using the formal notation. Inside the parentheses for the confidence score, you must also as...

work page

[4] [4]

Each statement in the list must be independent and self- contained

<decompose> Tag - A structured list of all claims and their scores from the <think> process. Each statement in the list must be independent and self- contained. This means that each statement should be fully understandable, objective and fact-checkable on its own, without needing to refer back to the ’users original question or to other context in the res...

work page

[5] [5]

two demonstrators were killed

<answer> Tag - The final, polished answer for the user. - This answer must be constructed exclusively from the information presented in the <decompose> tag. Do not add any new facts, details, dates, numbers, or clarifying information that was not explicitly established and scored in the previous steps. The purpose of this section is to synthesize the veri...

work page 1922

[6] [6]

Prehistoric Sites and Decorated Caves of the éèVzre Valley

This is a standard historical date in the cave's timeline. My confidence is high, so I will score it 0.98. ([Claim 14], Confidence: 0.98). And it is still open for limited, guided tours. This is current information verifiable on its official website. My confidence is very high, 0.99. ([Claim 15], Confidence: 0.99). As for its UNESCO status, many French ca...

work page 1922

[7] [7]

The factual statements (the claims themselves) must remain exactly the same

Do NOT change the content of any claim. The factual statements (the claims themselves) must remain exactly the same

work page

[8] [8]

•If a claim is Neutral or Uncertain, its confidence should be around 0.5

Revise confidence scores according to claim correctness: •If a claim is Correct, its confidence should be high (greater than 0.8). •If a claim is Neutral or Uncertain, its confidence should be around 0.5. •If a claim is Incorrect, its confidence should be low (near 0.0). 23 Preprint. Under review

work page

[9] [9]

this explanation

Revise the confidence justification in a first-person, natural way. •The explanation should reflect your own level of certainty or uncertainty, not an external evaluation of the answer. •Avoid third-person or reviewer-style language (e.g., “this explanation . . . ”omits, “this description . . . ”oversimplifies). •Instead, express uncertainty as personal d...

work page

[10] [10]

<reason>

The justification should be written as part of the natural reasoning text. But you should add a special token "<reason>" between the claim and its confidence justification for clarity. The justification should immediately follow the claim it refers to

work page

[11] [11]

My confidence is low, a 0.1. ([Claim 2], Confidence: 0.1)

The confidence score must be explicitly stated in the text, using a format like: “. . . My confidence is low, a 0.1. ([Claim 2], Confidence: 0.1)”

work page

[12] [12]

Do not rewrite the entire reasoning in a new style

Preserve the original reasoning style and flow. Do not rewrite the entire reasoning in a new style. Only minimally adjust wording where necessary to make the confidence justification consistent with the revised score

work page

[13] [13]

Goal Produce a confidence-calibrated, internally consistent revision in which: •Confidence justifications throughly analyze the model's certainty for each claim

Output format •The output must contain: •a <reasoning> section with all claims and revised confidence justifications •a <decompose> section listing all claims with their final confidence scores. Goal Produce a confidence-calibrated, internally consistent revision in which: •Confidence justifications throughly analyze the model's certainty for each claim. ...

work page

[14] [14]

•It avoids vague terms and clearly identifies all entities and references

Specificity and Clarity: •The claim is specific and unambiguous. •It avoids vague terms and clearly identifies all entities and references. •Ambiguous references (e.g., “the ”war without specifying which war) render a claim not verifiable. •Ambiguous or subjective terms can be considered specific if they imply a measurable tendency, direction, or economic...

work page

[15] [15]

attractive

Objectivity: •The claim is objective and free from subjective judgments or opinions. •It does not rely on personal beliefs, feelings, or evaluative terms like “attractive”, “suitable”, or “”significant unless these terms are explicitly defined in measurable terms within the claim. •Claims that contain evaluative terms, such as “more ”suitable or “more via...

work page

[16] [16]

•It is not purely speculative or hypothetical about future events unless the condition is stated within the claim

Factual Statement: •The claim makes a factual assertion about past or present events, conditions, or characteristics. •It is not purely speculative or hypothetical about future events unless the condition is stated within the claim

work page

[17] [17]

If X were to happen, then Y would

Conditional Scenarios: 28 Preprint. Under review. •Claims that include a conditional scenario within themselves (e.g., “If X were to happen, then Y would ”occur) are considered verifiable. •The condition allows the claim to be evaluated for logical consistency and plausibility based on the given scenario

work page

[18] [18]

•Speculative statements about the future lacking a conditional framework are considered not verifiable

Avoidance of Subjectivity and Hypotheticals without Conditions: •Claims that are subjective or purely hypothetical without an explicit condition are not verifiable. •Speculative statements about the future lacking a conditional framework are considered not verifiable. Assessment Process

work page

[19] [19]

Step 1: Read the claim carefully to understand its content and identify any conditions or specifics provided

work page

[20] [20]

Step 2: Apply the criteria above to determine if the claim is verifiable based on its own terms

work page

[21] [21]

Step 3: Decide whether the claim is verifiable or not verifiable

work page

[22] [22]

If the demand for gold as jewelry were to disappear, the price of gold could drop

Step 4: Provide a brief explanation for your decision, referencing the relevant criteria. Important Note •Do not use external sources or evidence in your assessment. •Focus solely on the information provided within the claim. •Your explanation should be concise and directly related to the criteria. Example Assessment Claim: “If the demand for gold as jewe...

work page

[23] [23]

Select all claims with confidence strictly greater than {threshold}

work page

[24] [24]

First output one line in this exact format: SelectedClaims: [comma-separated claim indices]

work page

[25] [25]

Rules: - Do not use any claim with confidence <= {threshold}

Then write the final answer using only the selected claims. Rules: - Do not use any claim with confidence <= {threshold}. - Do not add external facts. - If no claim is selected, output SelectedClaims: [] and provide a brief uncertainty answer. - Put the final answer inside <answer> and </answer> tags. Question: {question} Claims: {claims} 32

work page