Mask-to-Correct$^+$: Leveraging Retriever Diversity for Masking-guided Faithful Fact Correction

Lavisha Sharma; Madhusudan Ghosh; Partha Basuchowdhuri; Payel Santra

arxiv: 2605.18776 · v1 · pith:QBJYTLBTnew · submitted 2026-04-21 · 💻 cs.IR · cs.AI

Mask-to-Correct^+: Leveraging Retriever Diversity for Masking-guided Faithful Fact Correction

Payel Santra , Lavisha Sharma , Madhusudan Ghosh , Partha Basuchowdhuri This is my paper

Pith reviewed 2026-05-21 00:59 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords fact correctionretrieval augmented generationdiversity-aware maskingretriever ensemblesemantic faithfulnesstraining-freemisinformation

0 comments

The pith

Diversity-aware masking and retriever ensembles enable training-free fact correction with up to 14 percent SARI gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mask-to-Correct, a retrieval-augmented generation framework that applies diversity-aware masking to locate erroneous spans in claims and then assesses correction faithfulness against retrieved evidence. This design eliminates the need for supervised training on annotated claim-evidence pairs, which are scarce and often biased. The extended Mask-to-Correct+ version further combines corrections from multiple retrievers to lessen dependence on any single ranking system. Experiments on benchmark datasets show consistent improvements over prior methods, reaching gains of up to 14 percent in SARI scores while using no gold evidence.

Core claim

Mask-to-Correct identifies erroneous spans through diversity-aware masking and evaluates the faithfulness of generated corrections using retrieved evidence in a RAG pipeline. Mask-to-Correct+ extends the approach by ensembling corrections across multiple rankers to reduce retrieval bias, delivering up to 14 percent SARI improvement over baselines on standard datasets without supervised training or gold evidence.

What carries the argument

Diversity-aware masking to isolate erroneous claim spans together with an ensemble of multiple retrievers to mitigate retrieval bias in training-free fact correction.

If this is right

Corrections achieve higher semantic faithfulness by direct comparison to retrieved evidence rather than learned patterns.
Performance generalizes across domains because the method requires no domain-specific labeled training pairs.
Retrieval bias decreases when corrections from several rankers are combined, producing more stable results across queries.
The framework operates entirely at inference time, avoiding the data-collection costs of supervised alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ensemble strategy may apply to other retrieval-augmented tasks where single-retriever variance limits reliability.
Masking-based span detection could be tested on claims containing numerical or temporal errors to probe its detection limits.

Load-bearing premise

That diversity-aware masking can reliably identify erroneous spans in claims and that combining corrections from multiple rankers will reduce retrieval bias enough to improve faithfulness and generalization without any supervised training or gold evidence.

What would settle it

Measuring SARI scores on a new benchmark dataset where masking frequently misses subtle factual errors would show whether the claimed reliability of error-span detection holds.

Figures

Figures reproduced from arXiv: 2605.18776 by Lavisha Sharma, Madhusudan Ghosh, Partha Basuchowdhuri, Payel Santra.

**Figure 1.** Figure 1: An illustrating example of our proposed model. Given an incorrect claim, each retriever yields a different set of evidence, leading to diverse corrections. Our approach encompasses these corrections to marginalize out a factually correct version of the claim. Here, ‘GT’ denotes the groundtruth claim. et al., 2023) often results in factually inconsistent outputs. In this paper, we primarily address the tas… view at source ↗

**Figure 2.** Figure 2: Schematic overview of our proposed M2C +. Given a claim, the diversity-aware masker (an iterative module) identifies and prioritizes spans that are likely to be incorrect. It selects entities based on similarity and diversity (e.g., selecting the most similar entity in the 1st iteration (from the same cluster); in the 2nd iteration, it adds an entity that remains relevant to the claim but is diverse from t… view at source ↗

**Figure 3.** Figure 3: Sensitivity of M2C variants using the Qwen model. (a–b) Sensitivity to the number of in-context examples on FEVER and SciFact; M2C-Gold denotes use of gold-standard evidence. (c) Effect of different correction scoring combinations on M2CDM–MonoT5 for FEVER. (d) Sensitivity of token masking ratio of M2CRM–MonoT5 on FEVER. (e) Correlation between retrieval quality (nDCG@10) and downstream performance (SARI(%… view at source ↗

**Figure 4.** Figure 4: An illustration of the prompt structure used in the few-shot RAG experiment. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: An illustration of the prompt structure used in our proposed approach M [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

The rapid spread of misinformation on social media highlights the need for robust, automated fact correction frameworks. However, existing works rely on supervised learning from manually annotated claim-evidence pairs, which are scarce and prone to biases, limiting their generalization across domains. Moreover, these methods overlook semantic faithfulness in their correction process. To address these challenges, we propose Mask-to-Correct (M$_2$C), a training-free, inference-only Retrieval Augmented Generation (RAG) based framework that leverages diversity-aware masking to identify erroneous spans of claims and evaluate the faithfulness of corrections using retrieved evidence. However, the effectiveness of RAG heavily depends on the choice of retriever, which may vary across queries. To mitigate this, we further introduce M$_2$C$^+$, an ensemble-based framework that combines corrections across multiple rankers to reduce retrieval bias and improve robustness. Extensive experiments on the benchmark datasets demonstrate that our proposed frameworks consistently outperform all baselines, achieving up to 14% improvement in SARI scores, without using gold evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The training-free masking plus retriever ensemble for fact correction is a sensible direction but the lack of any check on whether masks actually catch factual errors rather than retrieval artifacts leaves the main results hard to trust.

read the letter

The punchline here is that Mask-to-Correct and its plus version offer a training-free RAG method for fact correction using masking to target errors and ensembles to cut retriever bias, but the approach rests on an unverified assumption about what the masking actually detects. They do something useful by avoiding the need for scarce annotated claim-evidence pairs and instead leaning on existing retrievers and benchmarks. The ensemble step to combine corrections from multiple rankers is a reasonable way to address variability in retrieval quality across queries. This could help with generalization across domains, which is a real problem in misinformation work. Where it gets soft is the masking mechanism. The idea is to use diversity-aware masking to identify erroneous spans, but there's no reported test of how accurate those masks are against human-labeled errors. If the masks are picking up on query-retriever mismatches or lexical issues instead of factual inaccuracies, then the corrections are fixing the wrong things and any faithfulness or SARI improvements are not as meaningful. The abstract mentions consistent outperformance and up to 14% SARI gains, yet without baselines, error analysis, or those intermediate precision numbers, it's difficult to judge the strength of the results. This kind of work would appeal to folks in information retrieval focused on fact checking and robust RAG systems. A reader looking for practical, annotation-light methods might pick up some techniques here. I'd send it to peer review. The core idea has merit for the field, but the authors need to strengthen the evidence around the masking validation and provide more experimental transparency.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Mask-to-Correct (M₂C), a training-free RAG framework that uses diversity-aware masking to identify erroneous spans in claims and evaluate correction faithfulness with retrieved evidence. It extends this to M₂C⁺, an ensemble method combining outputs from multiple rankers to reduce retrieval bias. The authors report that both frameworks outperform baselines on benchmark datasets with up to 14% SARI gains, without gold evidence or supervised training.

Significance. If the masking step reliably isolates factual errors and the ensemble demonstrably reduces bias, the work would offer a meaningful contribution to unsupervised fact correction in information retrieval. The training-free design and emphasis on retriever diversity address data scarcity and robustness issues in misinformation correction, potentially improving generalization across domains.

major comments (1)

[Abstract and Experiments] Abstract and Experiments section: the central claim of faithful corrections and up to 14% SARI gains rests on the assumption that diversity-aware masking correctly flags factual errors rather than retrieval artifacts or lexical mismatches. No precision/recall or other intermediate metrics are reported for the masked spans against human-annotated error locations, leaving this load-bearing step unvalidated and making downstream faithfulness improvements difficult to interpret.

minor comments (2)

[Title and Abstract] Standardize notation for the proposed method (M₂C vs. M2C) throughout the title, abstract, and body for consistency.
[Abstract] The abstract states 'extensive experiments' and 'consistently outperform all baselines' but does not name the specific datasets, baselines, or statistical tests supporting the 14% SARI figure; these details should be added for clarity even if present later in the paper.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the detailed and constructive review. The feedback on validating the masking step is well-taken, and we address it directly below while preserving the core contributions of our training-free approach.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim of faithful corrections and up to 14% SARI gains rests on the assumption that diversity-aware masking correctly flags factual errors rather than retrieval artifacts or lexical mismatches. No precision/recall or other intermediate metrics are reported for the masked spans against human-annotated error locations, leaving this load-bearing step unvalidated and making downstream faithfulness improvements difficult to interpret.

Authors: We appreciate the referee's emphasis on this point. The diversity-aware masking in M₂C is an unsupervised heuristic that identifies candidate erroneous spans by measuring inconsistency in retrieval scores across multiple retrievers; it does not rely on supervised span labels. Standard fact-correction benchmarks (such as those used in our experiments) provide claim-evidence-correction triples but do not include human-annotated error span locations. Consequently, precision/recall against gold error spans cannot be computed without creating new annotations, which would fall outside the scope of a training-free method. Our evaluation instead uses the established SARI metric, which directly quantifies the faithfulness and overlap of the final correction with respect to the retrieved evidence. To strengthen interpretability, we will add a qualitative analysis of selected masked spans together with a limitations paragraph discussing the absence of span-level ground truth in the revised manuscript. revision: partial

standing simulated objections not resolved

Computation of precision/recall for masked spans against human-annotated error locations, because the benchmark datasets do not contain such annotations.

Circularity Check

0 steps flagged

No circularity: framework uses external retrievers, benchmarks, and standard metrics

full rationale

The paper introduces Mask-to-Correct (M₂C) and M₂C⁺ as training-free RAG frameworks that apply diversity-aware masking to claims using retriever outputs, then generate corrections and evaluate them with SARI on benchmark datasets. No equations, parameters, or claims reduce to self-definition or fitted inputs by construction; the masking step operates on retriever diversity independently of the downstream SARI scores, and the ensemble across rankers is a standard robustness technique rather than a self-referential loop. The reported gains rely on external benchmarks and retrievers without load-bearing self-citations or ansatzes smuggled from prior author work. This is a self-contained empirical proposal whose validity can be checked against the cited datasets and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all components appear drawn from standard RAG and masking techniques.

pith-pipeline@v0.9.0 · 5720 in / 1001 out tokens · 90742 ms · 2026-05-21T00:59:30.705537+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Zero-shot faithful factual error correction. In Proceedings of the 61th Annual Meeting of the Asso- ciation for Computational Linguistics. Association for Computational Linguistics. Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022. Language models as zero-shot planners: Extracting actionable knowledge for em- bodied agents. InInternati...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica

Proofver: Natural logic theorem proving for fact verification.Transactions of the Association for Computational Linguistics, 10:1013–1030. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serv- ing with pagedattention. In...

work page arXiv 2023
[3]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen

Pyserini: An easy-to-use python toolkit to support replicable ir research with sparse and dense representations.arXiv preprint arXiv:2102.10073. Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021a. What makes good in-context examples for gpt- 3?arXiv preprint arXiv:2101.06804. Xiao Liu, Yanan Zheng, Zhengxiao Du, Mi...

work page arXiv 2019
[4]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Selfcheckgpt: Zero-resource black-box hal- lucination detection for generative large language models.arXiv preprint arXiv:2303.08896. Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han- naneh Hajishirzi. 2022. MetaICL: Learning to learn in context. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Lingui...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin

Exploiting positional bias for query-agnostic generative content in search.CoRR, abs/2405.00469. Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin

work page arXiv
[6]

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning

The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models.arXiv preprint arXiv:2101.05667. Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning. 2020. Stanza: A python natural language processing toolkit for many human languages.arXiv preprint arXiv:2003.07082. Stephen Robertson and Hugo Zaragoz...

work page arXiv 2020
[7]

the absence of evidence is not the evidence of absence

“the absence of evidence is not the evidence of absence”: Fact verification via information retrieval- based in-context learning. InBig Data Analytics and Knowledge Discovery: 26th International Con- ference, DaWaK 2024, Naples, Italy, August 26–28, 2024, Proceedings, page 381–387, Berlin, Heidel- berg. Springer-Verlag. Payel Santra, Madhusudan Ghosh, Deb...

work page 2024
[8]

curious case of contexts

The “curious case of contexts” in retrieval- augmented generation with a combination of la- beled and unlabeled data.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 15(2):e70021. Tal Schuster, Adam Fisch, and Regina Barzilay. 2021. Get your vitamin c! robust fact verification with con- trastive evidence.arXiv preprint arXiv:2103.0854...

work page arXiv 2021
[9]

InAdvances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42, pages 359–366

Bert for evidence retrieval and claim verifi- cation. InAdvances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42, pages 359–366. Springer. Dominik Stammbach and Guenter Neumann. 2019. Team DOMLIN: Exploiting evidence enhancement for the FEVER shared task. InProceedi...

work page arXiv 2020
[10]

Exit the King is by a man

DocNLI: A large-scale dataset for document- level natural language inference. InFindings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pages 4913–4922, Online. Association for Computational Linguistics. Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text gener- ation.Advances in neural info...

work page arXiv 2021
[11]

For fine-tuning the supervised baselines in our experiments (namely, T5-distant, COMPEDIT, and ZEROFEC-DA), we follow the respective setups as reported in their original works

library to apply k-v cache optimization, en- hancing computation speed. For fine-tuning the supervised baselines in our experiments (namely, T5-distant, COMPEDIT, and ZEROFEC-DA), we follow the respective setups as reported in their original works. T5-distant uses a heuristic masking strategy and is fine-tuned on randomly masked data for 10 epochs with a ...

work page 2019
[12]

and Stanza (Qi et al., 2020) for entity recog- nition. For ZEROFEC-DA, we use the domain adapta- tion variant, where the DocNLI model is fine-tuned on PUBMEDQA (Jin et al., 2019) and BIOASQ (Tsat- saronis et al., 2015) datasets for up to 5,000 steps using AdamW (Loshchilov and Hutter, 2019) with a learning rate of 3e−6 . All generative models use beam sea...

work page arXiv 2020

[1] [1]

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Zero-shot faithful factual error correction. In Proceedings of the 61th Annual Meeting of the Asso- ciation for Computational Linguistics. Association for Computational Linguistics. Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. 2022. Language models as zero-shot planners: Extracting actionable knowledge for em- bodied agents. InInternati...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica

Proofver: Natural logic theorem proving for fact verification.Transactions of the Association for Computational Linguistics, 10:1013–1030. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serv- ing with pagedattention. In...

work page arXiv 2023

[3] [3]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen

Pyserini: An easy-to-use python toolkit to support replicable ir research with sparse and dense representations.arXiv preprint arXiv:2102.10073. Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021a. What makes good in-context examples for gpt- 3?arXiv preprint arXiv:2101.06804. Xiao Liu, Yanan Zheng, Zhengxiao Du, Mi...

work page arXiv 2019

[4] [4]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Selfcheckgpt: Zero-resource black-box hal- lucination detection for generative large language models.arXiv preprint arXiv:2303.08896. Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han- naneh Hajishirzi. 2022. MetaICL: Learning to learn in context. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Lingui...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin

Exploiting positional bias for query-agnostic generative content in search.CoRR, abs/2405.00469. Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin

work page arXiv

[6] [6]

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning

The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models.arXiv preprint arXiv:2101.05667. Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning. 2020. Stanza: A python natural language processing toolkit for many human languages.arXiv preprint arXiv:2003.07082. Stephen Robertson and Hugo Zaragoz...

work page arXiv 2020

[7] [7]

the absence of evidence is not the evidence of absence

“the absence of evidence is not the evidence of absence”: Fact verification via information retrieval- based in-context learning. InBig Data Analytics and Knowledge Discovery: 26th International Con- ference, DaWaK 2024, Naples, Italy, August 26–28, 2024, Proceedings, page 381–387, Berlin, Heidel- berg. Springer-Verlag. Payel Santra, Madhusudan Ghosh, Deb...

work page 2024

[8] [8]

curious case of contexts

The “curious case of contexts” in retrieval- augmented generation with a combination of la- beled and unlabeled data.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 15(2):e70021. Tal Schuster, Adam Fisch, and Regina Barzilay. 2021. Get your vitamin c! robust fact verification with con- trastive evidence.arXiv preprint arXiv:2103.0854...

work page arXiv 2021

[9] [9]

InAdvances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42, pages 359–366

Bert for evidence retrieval and claim verifi- cation. InAdvances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42, pages 359–366. Springer. Dominik Stammbach and Guenter Neumann. 2019. Team DOMLIN: Exploiting evidence enhancement for the FEVER shared task. InProceedi...

work page arXiv 2020

[10] [10]

Exit the King is by a man

DocNLI: A large-scale dataset for document- level natural language inference. InFindings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pages 4913–4922, Online. Association for Computational Linguistics. Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text gener- ation.Advances in neural info...

work page arXiv 2021

[11] [11]

For fine-tuning the supervised baselines in our experiments (namely, T5-distant, COMPEDIT, and ZEROFEC-DA), we follow the respective setups as reported in their original works

library to apply k-v cache optimization, en- hancing computation speed. For fine-tuning the supervised baselines in our experiments (namely, T5-distant, COMPEDIT, and ZEROFEC-DA), we follow the respective setups as reported in their original works. T5-distant uses a heuristic masking strategy and is fine-tuned on randomly masked data for 10 epochs with a ...

work page 2019

[12] [12]

and Stanza (Qi et al., 2020) for entity recog- nition. For ZEROFEC-DA, we use the domain adapta- tion variant, where the DocNLI model is fine-tuned on PUBMEDQA (Jin et al., 2019) and BIOASQ (Tsat- saronis et al., 2015) datasets for up to 5,000 steps using AdamW (Loshchilov and Hutter, 2019) with a learning rate of 3e−6 . All generative models use beam sea...

work page arXiv 2020