DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

Ankur Padia; Francis Ferraro; Shubhashis Roy Dipta

arxiv: 2605.27858 · v1 · pith:SGHDVEULnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI· cs.LG

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

Shubhashis Roy Dipta , Ankur Padia , Francis Ferraro This is my paper

Pith reviewed 2026-06-29 13:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords claim verificationreinforcement learningquestion decompositionsemi-supervised learningtraceable reasoningGRPOdata curation

0 comments

The pith

A 7B DecomposeRL policy trained on 5K curated claims matches 32B baselines and GPT-4.1-mini on claim verification while producing inspectable traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DecomposeRL to close the gap between accurate but opaque end-to-end claim verifiers and traceable but weaker decomposition methods. It casts decomposition as an RL policy optimized via GRPO together with a multi-faceted reward that rewards useful, informative, and diverse questions. A curation funnel compresses 115K claims into a 5K subset that supports both full supervision and semi-supervised training from unlabeled data. The resulting 7B model reaches 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 biomedical, political, scientific, and general benchmarks.

Core claim

DecomposeRL treats claim decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble; a data-curation funnel reduces 115K fact-verification claims to a 5K subset; the resulting 7B policy achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy on 11 benchmarks, matching 32B and GPT-4.1-mini models while also outperforming baselines in a semi-supervised regime that uses only 10 percent labeled data.

What carries the argument

GRPO-trained RL policy with multi-faceted reward ensemble that learns to generate useful, informative, and diverse questions for decomposing claims.

If this is right

Traceable decomposition becomes possible at the accuracy level of end-to-end classifiers.
Semi-supervised training works with only 10 percent labeled claims.
Model size can be reduced to 7B while still matching 32B and GPT-4.1-mini performance.
A compact curated dataset of roughly 5K examples suffices for strong generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same RL decomposition approach could be applied to other multi-step reasoning tasks that benefit from inspectable intermediate steps.
Generated question traces could be used directly by human fact-checkers to audit model decisions.
Varying the curation funnel's selection criteria might allow further reduction in the number of required labeled examples.
Combining the policy with larger base models could produce additional accuracy gains without retraining the full pipeline.

Load-bearing premise

The curation funnel that reduces 115K claims to 5K claims retains sufficient learning signal for both in-domain and out-of-domain generalization without overfitting to the curation criteria or the benchmarks.

What would settle it

Measuring balanced accuracy of the released 7B model on a fresh collection of claims drawn from domains absent from the original 11 benchmarks.

Figures

Figures reproduced from arXiv: 2605.27858 by Ankur Padia, Francis Ferraro, Shubhashis Roy Dipta.

**Figure 1.** Figure 1: What makes a question useful, informative, and diverse? DECOMPOSERL addresses this along three reward axes (full reward stack in §3.2; full trace for this claim in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Data-curation funnel. Cumulative trainingrow count after each stage of the pipeline (§2). claims with fewer than two named entities (e.g., “This is true.”) carry no learning signal and are discarded using a union of science and general-domain NER models (§B). 2.3 Difficulty-Based Filtering Modern LLM-based fact-checkers (Tang et al., 2024) correctly verify a substantial fraction of public-benchmark claim… view at source ↗

**Figure 3.** Figure 3: The DECOMPOSERL reward ensemble and semi-supervised training. Given a claim c and evidence d, the policy πθ produces a trace of n question–answer cycles (qi , ai) and a verdict v. (A) Seven rewards with heterogeneous evaluators: deterministic – format Rfmt, verification Rver, question count Rqc; embedding-based – diversity Rdiv (Maximal Margin Relevance over {qi}); LLM-as-a-judge – coverage Rcov (can the j… view at source ↗

**Figure 4.** Figure 4: Results comparing DECOMPOSERL with multiple baselines. Balanced accuracy (%) of DECOMPOSERL against 11 baselines at matched scale, split into the in-domain Avg over 9 datasets (left) and the out-ofdomain Avg over CoverBench and LLM-AggreFact (right). DECOMPOSERL (red) is the only system dominating every prompted baseline on both panels simultaneously. 75.0 77.5 80.0 82.5 85.0 87.5 90.0 92.5 Balanced accu… view at source ↗

**Figure 5.** Figure 5: DECOMPOSERL vs. the best baseline at each scale. In-domain Avg (left) and out-of-domain Avg (right) balanced accuracy for the strongest prompted baseline at each scale (3B, 7B, 14B, 32B) and a proprietary frontier baseline. DECOMPOSERL at 7B matches the 32B baseline and frontier on in-domain Avg and trails the larger baselines by less than 4 points on out-of-domain Avg. distance to over-represent the outli… view at source ↗

**Figure 6.** Figure 6: Selector ablation on the post-dedup pool (N=17,328 claims, budget |S|=5,000). (a) PCA density of the full pool (grey) and of each selector’s pick (red / blue / orange), with coverage of non-empty pool bins annotated. (b) Cumulative fraction of the pool whose nearest selected claim lies within cosine distance d; lower-and-leftward is better worst-case coverage. Dotted verticals mark each method’s d95%. Subm… view at source ↗

**Figure 7.** Figure 7: Supervision-rate sweep. DECOMPOSERL-7B accuracy as a function of the supervision rate s (fraction of training claims with a ground truth label; the remaining 1−s use the self-consistency and relative-necessity fallbacks from §3.3). In-domain Avg (left) is essentially flat – the policy can be trained with as little as 10% verdict supervision at a 1.7-point cost. LLM-AggreFact (right) is also flat: standard-… view at source ↗

**Figure 8.** Figure 8: Reward ablation (same data as [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: 3B ablation. All methods evaluated with a Qwen-3B base on the same in-domain (9 datasets) and out-of-domain (2 datasets) suite. DECOMPOSERL-3B is the only 3B configuration that simultaneously tops both panels, beating the strongest 3B baseline by +4.7 on in-domain Avg and +0.6 on out-of-domain Avg. The reward stack therefore transfers to a smaller policy without retuning, isolating the gain from policy siz… view at source ↗

**Figure 10.** Figure 10: shows a representative claim from the curated training set with its silver decomposition. H.2 Verification Traces We first walk through the trace for the intro-teaser claim ( [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Trace for the intro-teaser claim of [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Multi-hop SUPPORTED success. The model bridges the two atomic facts “Dmitrovic plays for Eibar” ´ and “Eibar plays in La Liga” through the shared entity SD Eibar; the verdict only follows from their conjunction. Claim FEVEROUS (in-domain) “George Brown began his Liberal Party leadership in Canada on the first of July in 1867.” Evidence 1 George Brown reorganized the Clear Grit (Liberal) Party in 1857, sup… view at source ↗

**Figure 13.** Figure 13: Clean single-fact REFUTED success. The model isolates the numeric mismatch (1857 vs. 1867) with a targeted follow-up question rather than over-relying on the verdict head; the second question explicitly nails the year discrepancy. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Calibrated abstention. The model abstains on the unsupported Vervet sub-claim (Q2) rather than guessing, then routes the verdict through the answerable Tantalus sub-fact (Q3). The abstention does not break the verdict because the claim is already refutable from the answerable half, exactly the behaviour the joint-multiplicative reward is designed to elicit (§3.2). Claim CoverBench (out-of-domain, tabular)… view at source ↗

**Figure 15.** Figure 15: Counting-style failure on out-of-domain tabular evidence. The decomposition is locally sensible (each sub-question is on-topic, and the model retrieves the full ordering in Q3), but Q2 under-counts (the four U.S. golfers are 1, 2, 3, and 5; Davis Love III at rank 5 is dropped). The model then agrees with the claim’s stated count of “three” instead of cross-checking it against the answer to Q3, yielding a … view at source ↗

read the original abstract

Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DecomposeRL gets competitive traceable verification numbers from a 7B RL policy on a 5K curated set, but the funnel's construction is the unexamined load-bearing step.

read the letter

The main takeaway is that this paper trains a decomposition policy with GRPO and a reward ensemble, then uses a curation funnel to shrink 115K claims down to 5K for efficient training. A 7B model reaches 86.3 in-domain and 69.8 out-of-domain balanced accuracy on 11 benchmarks spanning biomedical, political, and other domains, matching 32B baselines and GPT-4.1-mini while also improving in the semi-supervised case with 10% labels. Code and models are released.

What is new is the explicit RL framing of decomposition combined with the multi-faceted rewards and the funnel that makes full supervision on a small set feasible. It directly targets the accuracy-traceability gap without obvious performance loss.

The curation funnel is the soft spot worth checking. Reducing to 5K is central to the result, yet the abstract gives only a high-level description. If the funnel implicitly selects for patterns that align with the evaluation benchmarks, both the in-domain and OOD numbers could be overstated. The reward ensemble might reinforce that alignment. The paper reports no circular fitting in the accuracies themselves, and the empirical setup looks standard.

This is for NLP groups working on explainable claim verification or RL for decomposition tasks. Readers who care about traceable outputs on fact-checking benchmarks will get concrete numbers and a working method.

It deserves peer review because the performance claims are specific, the approach is distinct from prior decomposition work, and the code release supports checking. Reviewers should focus on the funnel details and any overlap with test data.

Referee Report

1 major / 1 minor

Summary. The paper proposes DecomposeRL, which frames claim decomposition for verification as an RL policy trained via GRPO with a multi-faceted reward ensemble. It introduces a data-curation funnel to distill 115K claims into a compact 5K subset for efficient training, enabling both fully supervised and semi-supervised modes. A DecomposeRL-7B model trained on the 5K subset reports 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 benchmarks (biomedical, political, scientific, general), matching 32B baselines and GPT-4.1-mini while outperforming in semi-supervised settings with 10% labeled data. Code, data, and models are released.

Significance. If the central performance claims hold, the work would be significant for showing that compact models can achieve competitive traceable claim verification via RL and curated data, bridging the gap between accurate but opaque end-to-end classifiers and inspectable but weaker decomposition methods. The code and model release is a clear strength that supports reproducibility and further research.

major comments (1)

[Abstract] Abstract: the headline result (86.3/69.8 balanced accuracy on 11 benchmarks) rests on the data-curation funnel reducing 115K claims to 5K; the manuscript provides only a high-level description of this funnel and does not demonstrate that its selection criteria avoid any information derived from the 11 evaluation benchmarks or that the retained examples preserve sufficient diversity for the reported OOD transfer. This is load-bearing for the claim that the numbers are non-circular.

minor comments (1)

[Abstract] Abstract: minor grammatical issue ('DecomposeRL an accurate claim-verifier that produce inspectable traces') should be corrected for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the importance of rigorously documenting the data-curation process. We address the concern below and commit to a substantive revision.

read point-by-point responses

Referee: [Abstract] Abstract: the headline result (86.3/69.8 balanced accuracy on 11 benchmarks) rests on the data-curation funnel reducing 115K claims to 5K; the manuscript provides only a high-level description of this funnel and does not demonstrate that its selection criteria avoid any information derived from the 11 evaluation benchmarks or that the retained examples preserve sufficient diversity for the reported OOD transfer. This is load-bearing for the claim that the numbers are non-circular.

Authors: We agree that the current manuscript provides only a high-level description of the curation funnel and does not include explicit verification that the selection criteria are independent of the 11 evaluation benchmarks or quantitative evidence of retained diversity. In the revision we will (1) expand the methods section with the precise, reproducible criteria used to distill the 115K claims (including all filtering, scoring, and selection steps), (2) add an explicit statement and supporting table confirming that the curation pipeline operated exclusively on the 115K pool with no access to or leakage from any of the 11 held-out benchmarks, and (3) report diversity statistics (e.g., claim-topic distribution, length, source coverage) for the final 5K subset relative to both the original pool and the OOD evaluation sets. These additions will directly substantiate the non-circular nature of the reported numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper reports balanced accuracy figures obtained by training a policy on a curated subset and evaluating on 11 separate claim-verification benchmarks. No equations, derivations, or self-citations are invoked to obtain the reported numbers; the accuracies are measured quantities on held-out test data rather than quantities that reduce to the training inputs by construction. The data-curation funnel is a preprocessing step whose details do not appear in any load-bearing derivation that equates the final performance to the curation criteria themselves.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions of RL policy optimization and the effectiveness of the described curation and reward design; no new physical entities or ad-hoc constants are introduced beyond typical ML hyperparameters.

free parameters (1)

reward ensemble weights
The multi-faceted reward likely requires weights or scaling factors chosen or tuned during development to balance usefulness, informativeness, and diversity.

axioms (2)

domain assumption GRPO can be applied to train a policy for generating useful decomposition questions in claim verification
Invoked when framing decomposition as an RL policy trained with GRPO.
domain assumption The curation funnel preserves representative learning signals across domains
Required for the claim that 5K curated claims suffice for the reported in- and out-of-domain performance.

pith-pipeline@v0.9.1-grok · 5764 in / 1408 out tokens · 49776 ms · 2026-06-29T13:36:50.335589+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal

Association for Computational Linguistics. Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. The fact extraction and VERification over unstructured and structured information (FEVEROUS) shared task. InProceedings of the Fourth Workshop on Fact Extraction ...

2021
[2]

InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Fool me twice: Entailment from Wikipedia gamification. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Leo Gao, John Schulman, and Jacob Hilton. 2023. Scal- ing laws for reward model overoptimization. InIn- ternational Conference on Machine Learning, ICML 202...

2021
[3]

Reinforcement Learning via Self-Distillation

Deepseek-r1: Incentivizing reasoning capabil- ity in llms via reinforcement learning. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop QA dataset for comprehensive evaluation of reason- ing steps. InProceedings of the 28th International Conference on Computational Linguistics. Matthew Honnibal, Ines Montani, ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

InThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024

Let’s verify step by step. InThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. 10 Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human align- ment. InProceedings of the 2023 Conference on Empirical Methods in Na...

2024
[5]

InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelli- gence, January 25-30, 2015, Austin, Texas, USA

Lazier than lazy greedy. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelli- gence, January 25-30, 2015, Austin, Texas, USA. George L. Nemhauser, Laurence A. Wolsey, and Mar- shall L. Fisher. 1978. An analysis of approximations for maximizing submodular set functions—I.Mathe- matical Programming, (1). Mark Neumann, Daniel King, Iz Bel...

2015
[6]

Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems 35: Annual Confer- ence on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Liangming Pan, Xinyuan Lu, Min-Yen Kan, and Preslav Nakov. 2023a. QACheck: A demonstration syst...

2022
[7]

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Sys- tems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Michael Schlichtkrull, Zhijiang Guo, and Andreas Vla- chos. 2023. Averitec: A dataset for real...

2023
[8]

Long-form factuality in large language models. InAdvances in Neural Information Processing Sys- tems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Kai Wei, Rishabh K. Iyer, and Jeff A. Bilmes. 2015. Submodularity in data subset selection and active learning. InProceedings ...

2024
[9]

In2014 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP)

Submodular subset selection for large-scale speech training data. In2014 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gard- ner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understand- ing benchmark.Transactions of the Association for ...

2020
[10]

Closer to our task, Chen et al

formalizes atomic checking for long-form factuality. Closer to our task, Chen et al. (2024) (henceforthChen- 2024) embed a learned claim- decomposer inside an end-to-end fact-checking pipeline with retrieval and claim-focused sum- marization, training the decomposer on existing gold decompositions. A parallel prompting line: chain-of-thought (Wei et al., ...

2024
[11]

lucky-guess

that has motivated a parallel line of work on process reward models: PRM800K-style human- annotated step labels (Lightman et al., 2024), au- tomatic process-reward construction from rollout outcomes (Setlur et al., 2025), and self-consistency- derived rewards (Wang et al., 2023). Reward mod- els more broadly are also known to be vulnerable to over-optimiz...

2024
[12]

breadth” axis of decomposition quality, and their OOD drops are correspondingly larger than those of Joint Quality (−2.0) and Question Count (−1.7), which operate on “depth

targets long-evidence multi-hop verification, and LLM-AggreFact (Tang et al., 2024) aggregates factuality judgments across heterogeneous gener- ators. Neither corpus contributes to training, and both are used only to probe generalization beyond the training distribution. F Result Plots We complement the per-dataset numeric tables in §4.1 with two bar-char...

2024
[13]

IsGet Lowa song by Lil Jon & the East Side Boyz (featuring the Ying Yang Twins)?
[14]

Does the document state thatGet Lowpeaked at number two on the Billboard Hot 100?
[15]

top ranked

Does the document mention any Lil Jon song that achieved a Billboard Hot 100 peak higher than num- ber two? Figure 10:A representative training claim.The silver decomposition (§2.5) isolates two factual checks (exis- tence, Hot 100 peak) and a comparative check that pins down the “top ranked” qualifier. partially-unsupported claim, and a counting-style fa...

1903
[16]

Identify explicit connectives (and, or, but, because, which, etc.) and implicit assumptions, comparisons, or vague terms that each need separate verification,→
[17]

Classify each sub-claim by type (e.g., entity, relational, quantitative, causal, temporal, comparative, etc.)
[18]

## Iterative Question-Answer Cycle After your initial analysis, enter an iterative cycle where you:

Note which sub-claims are independently falsifiable -- if any single one is refuted, the entire claim is refuted - Write out a numbered checklist of these sub-claims (this list will guide your verification cycle) - Identify any ambiguous, vague, or underspecified elements in the claim - Determine what specific question you should ask It's OK for this sect...
[19]

**Ask a Question**: In <question> tags, pose a single specific verification question that addresses one aspect of the claim. Your question should target:,→ - A specific atomic sub-claim that needs verification - An ambiguous element that needs clarification - An underspecified term or concept - Any other information needed to determine the claim's accuracy
[20]

I don't know

**Answer the Question**: In <answer> tags, answer your question using **only** the evidence document: - Search the evidence document for relevant information. If you find relevant passages, quote them directly. - If the evidence document contains sufficient information, use it to answer the question and cite the relevant passage. - If the evidence documen...
[21]

**Evaluate Sufficiency**: In <think> tags, reason about whether you now have sufficient information to verify the claim. Consider:,→ - List which sub-claims have been verified so far and which remain unverified - Are there remaining ambiguous or underspecified elements in the claim? - Do you need additional information to make a confident verification jud...
[22]

I don't know

**Repeat or Conclude**: - If more information is needed, return to step 1 and ask another question. - If you have sufficient information, proceed to final verification. Continue the cycle until every sub-claim identified in your initial analysis has been addressed. Once all sub-claims are covered, proceed to final verification. Do not ask redundant questi...
[23]

**Atomic sub-claims**: Break down the main claim into its fundamental, indivisible components that each require verification,→
[24]

Partially answerable

**Under-specified elements**: Identify vague or ambiguous parts of the claim that need clarification to enable proper verification,→ 27 Guidelines for your analysis: - Generate between 1 and 20 questions - Aim for the smallest possible set that still ensures complete verification - Avoid redundant questions that provide diminishing returns - Each question...
[25]

**Is a question**: Does the text contain an actual question rather than being purely a statement, analysis, or explanation? A brief setup before the question is acceptable, but the text must contain an actual question.,→
[26]

**Single-focus**: Does the question ask about exactly one thing? A question fails this if it asks about multiple distinct aspects, facts, or relationships in a single question.,→
[27]

and", "or

**No conjunctions**: Does the question avoid using "and", "or", "as well as", or similar conjunctions to join distinct sub-claims or topics? Minor conjunctions within a single concept (e.g., "cause and effect") are acceptable.,→
[28]

**Verifiable**: Does the question have a definitive yes/no or specific factual answer? It should not be open-ended, subjective, or require an essay-length response.,→
[29]

**Grounded**: Does the question reference a specific entity, fact, number, or detail from the claim rather than being generic or abstract?,→ First, briefly reason about each criterion. Then provide your final answers inside <answer> tags in the exact format: <answer> is_question:YES/NO single_focus:YES/NO no_conjunctions:YES/NO verifiable:YES/NO grounded:...
[30]

List each sub-claim in the claim
[31]

Determine if each sub-claim is supported/refuted/unknown based on the answers
[32]

Then provide your final verdict inside <verdict> tags containing only one of: Supported, Refuted, or Not Enough Information.,→ 29

Aggregate to final verdict First, briefly explain your reasoning by analyzing how each answer relates to the claim. Then provide your final verdict inside <verdict> tags containing only one of: Supported, Refuted, or Not Enough Information.,→ 29

[1] [1]

Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal

Association for Computational Linguistics. Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. The fact extraction and VERification over unstructured and structured information (FEVEROUS) shared task. InProceedings of the Fourth Workshop on Fact Extraction ...

2021

[2] [2]

InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Fool me twice: Entailment from Wikipedia gamification. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Leo Gao, John Schulman, and Jacob Hilton. 2023. Scal- ing laws for reward model overoptimization. InIn- ternational Conference on Machine Learning, ICML 202...

2021

[3] [3]

Reinforcement Learning via Self-Distillation

Deepseek-r1: Incentivizing reasoning capabil- ity in llms via reinforcement learning. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop QA dataset for comprehensive evaluation of reason- ing steps. InProceedings of the 28th International Conference on Computational Linguistics. Matthew Honnibal, Ines Montani, ...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

InThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024

Let’s verify step by step. InThe Twelfth In- ternational Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. 10 Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human align- ment. InProceedings of the 2023 Conference on Empirical Methods in Na...

2024

[5] [5]

InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelli- gence, January 25-30, 2015, Austin, Texas, USA

Lazier than lazy greedy. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelli- gence, January 25-30, 2015, Austin, Texas, USA. George L. Nemhauser, Laurence A. Wolsey, and Mar- shall L. Fisher. 1978. An analysis of approximations for maximizing submodular set functions—I.Mathe- matical Programming, (1). Mark Neumann, Daniel King, Iz Bel...

2015

[6] [6]

Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems 35: Annual Confer- ence on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Liangming Pan, Xinyuan Lu, Min-Yen Kan, and Preslav Nakov. 2023a. QACheck: A demonstration syst...

2022

[7] [7]

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Sys- tems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Michael Schlichtkrull, Zhijiang Guo, and Andreas Vla- chos. 2023. Averitec: A dataset for real...

2023

[8] [8]

Long-form factuality in large language models. InAdvances in Neural Information Processing Sys- tems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Kai Wei, Rishabh K. Iyer, and Jeff A. Bilmes. 2015. Submodularity in data subset selection and active learning. InProceedings ...

2024

[9] [9]

In2014 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP)

Submodular subset selection for large-scale speech training data. In2014 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gard- ner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understand- ing benchmark.Transactions of the Association for ...

2020

[10] [10]

Closer to our task, Chen et al

formalizes atomic checking for long-form factuality. Closer to our task, Chen et al. (2024) (henceforthChen- 2024) embed a learned claim- decomposer inside an end-to-end fact-checking pipeline with retrieval and claim-focused sum- marization, training the decomposer on existing gold decompositions. A parallel prompting line: chain-of-thought (Wei et al., ...

2024

[11] [11]

lucky-guess

that has motivated a parallel line of work on process reward models: PRM800K-style human- annotated step labels (Lightman et al., 2024), au- tomatic process-reward construction from rollout outcomes (Setlur et al., 2025), and self-consistency- derived rewards (Wang et al., 2023). Reward mod- els more broadly are also known to be vulnerable to over-optimiz...

2024

[12] [12]

breadth” axis of decomposition quality, and their OOD drops are correspondingly larger than those of Joint Quality (−2.0) and Question Count (−1.7), which operate on “depth

targets long-evidence multi-hop verification, and LLM-AggreFact (Tang et al., 2024) aggregates factuality judgments across heterogeneous gener- ators. Neither corpus contributes to training, and both are used only to probe generalization beyond the training distribution. F Result Plots We complement the per-dataset numeric tables in §4.1 with two bar-char...

2024

[13] [13]

IsGet Lowa song by Lil Jon & the East Side Boyz (featuring the Ying Yang Twins)?

[14] [14]

Does the document state thatGet Lowpeaked at number two on the Billboard Hot 100?

[15] [15]

top ranked

Does the document mention any Lil Jon song that achieved a Billboard Hot 100 peak higher than num- ber two? Figure 10:A representative training claim.The silver decomposition (§2.5) isolates two factual checks (exis- tence, Hot 100 peak) and a comparative check that pins down the “top ranked” qualifier. partially-unsupported claim, and a counting-style fa...

1903

[16] [16]

Identify explicit connectives (and, or, but, because, which, etc.) and implicit assumptions, comparisons, or vague terms that each need separate verification,→

[17] [17]

Classify each sub-claim by type (e.g., entity, relational, quantitative, causal, temporal, comparative, etc.)

[18] [18]

## Iterative Question-Answer Cycle After your initial analysis, enter an iterative cycle where you:

Note which sub-claims are independently falsifiable -- if any single one is refuted, the entire claim is refuted - Write out a numbered checklist of these sub-claims (this list will guide your verification cycle) - Identify any ambiguous, vague, or underspecified elements in the claim - Determine what specific question you should ask It's OK for this sect...

[19] [19]

**Ask a Question**: In <question> tags, pose a single specific verification question that addresses one aspect of the claim. Your question should target:,→ - A specific atomic sub-claim that needs verification - An ambiguous element that needs clarification - An underspecified term or concept - Any other information needed to determine the claim's accuracy

[20] [20]

I don't know

**Answer the Question**: In <answer> tags, answer your question using **only** the evidence document: - Search the evidence document for relevant information. If you find relevant passages, quote them directly. - If the evidence document contains sufficient information, use it to answer the question and cite the relevant passage. - If the evidence documen...

[21] [21]

**Evaluate Sufficiency**: In <think> tags, reason about whether you now have sufficient information to verify the claim. Consider:,→ - List which sub-claims have been verified so far and which remain unverified - Are there remaining ambiguous or underspecified elements in the claim? - Do you need additional information to make a confident verification jud...

[22] [22]

I don't know

**Repeat or Conclude**: - If more information is needed, return to step 1 and ask another question. - If you have sufficient information, proceed to final verification. Continue the cycle until every sub-claim identified in your initial analysis has been addressed. Once all sub-claims are covered, proceed to final verification. Do not ask redundant questi...

[23] [23]

**Atomic sub-claims**: Break down the main claim into its fundamental, indivisible components that each require verification,→

[24] [24]

Partially answerable

**Under-specified elements**: Identify vague or ambiguous parts of the claim that need clarification to enable proper verification,→ 27 Guidelines for your analysis: - Generate between 1 and 20 questions - Aim for the smallest possible set that still ensures complete verification - Avoid redundant questions that provide diminishing returns - Each question...

[25] [25]

**Is a question**: Does the text contain an actual question rather than being purely a statement, analysis, or explanation? A brief setup before the question is acceptable, but the text must contain an actual question.,→

[26] [26]

**Single-focus**: Does the question ask about exactly one thing? A question fails this if it asks about multiple distinct aspects, facts, or relationships in a single question.,→

[27] [27]

and", "or

**No conjunctions**: Does the question avoid using "and", "or", "as well as", or similar conjunctions to join distinct sub-claims or topics? Minor conjunctions within a single concept (e.g., "cause and effect") are acceptable.,→

[28] [28]

**Verifiable**: Does the question have a definitive yes/no or specific factual answer? It should not be open-ended, subjective, or require an essay-length response.,→

[29] [29]

**Grounded**: Does the question reference a specific entity, fact, number, or detail from the claim rather than being generic or abstract?,→ First, briefly reason about each criterion. Then provide your final answers inside <answer> tags in the exact format: <answer> is_question:YES/NO single_focus:YES/NO no_conjunctions:YES/NO verifiable:YES/NO grounded:...

[30] [30]

List each sub-claim in the claim

[31] [31]

Determine if each sub-claim is supported/refuted/unknown based on the answers

[32] [32]

Then provide your final verdict inside <verdict> tags containing only one of: Supported, Refuted, or Not Enough Information.,→ 29

Aggregate to final verdict First, briefly explain your reasoning by analyzing how each answer relates to the claim. Then provide your final verdict inside <verdict> tags containing only one of: Supported, Refuted, or Not Enough Information.,→ 29