Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs

Pratuat Amatya; Vinay Setty

arxiv: 2606.08605 · v1 · pith:AU6W3OU2new · submitted 2026-06-07 · 💻 cs.CL

Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs

Pratuat Amatya , Vinay Setty This is my paper

Pith reviewed 2026-06-27 18:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual fact-checkingfine-tuned modelsclaim detectionveracity predictionencoder modelsLLM comparisonproduction deploymentefficiency

0 comments

The pith

Fine-tuned compact models deliver stable multilingual fact-checking across 114 languages while matching LLM accuracy with far lower latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a production fact-checking pipeline with three stages: claim detection, evidence retrieval with re-ranking, and veracity prediction. It fine-tunes XLM-RoBERTa-Large for detection, mmBERT-base for stance classification, and a SetFit re-ranker, then measures them against GPT-5.2, Claude Opus 4.6, and Qwen3-8b on real production data. Results show the compact models maintain strong, stable performance across many languages and cut latency substantially on identical hardware. The work argues these self-hosted encoders form a practical base for high-volume, privacy-sensitive deployments.

Core claim

Experiments on production data spanning 114 languages for claim detection and 28 languages for veracity prediction show that task-specific fine-tuning provides strong and stable multilingual performance, while the fine-tuned retrieval model remains competitive with modern proprietary embeddings. Same-hardware latency measurements further show large efficiency gains for encoder-based components, supporting their use in production deployments with tight cost and privacy constraints.

What carries the argument

The modular three-stage pipeline of claim detection, evidence retrieval and re-ranking, and veracity prediction, implemented with task-specific fine-tuned encoder models.

If this is right

Task-specific fine-tuning yields more stable results than general-purpose LLMs when languages vary widely.
Encoder models can handle high-throughput claim detection without the cost of calling large generative models.
Self-hosted fine-tuned components preserve data privacy while remaining competitive on retrieval quality.
The same modular design can support ongoing updates to individual stages without retraining the entire system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fine-tuning patterns may transfer to other high-volume multilingual text classification problems such as sentiment or topic detection.
The efficiency advantage could allow fact-checking services to operate in regions with limited compute resources.
If the production data distribution shifts over time, periodic re-tuning of the compact models would be needed to maintain the reported stability.

Load-bearing premise

The production data used for evaluation is representative of real-world multilingual fact-checking challenges.

What would settle it

A new test set of claims or languages outside the current production data where the fine-tuned models fall below the LLM baselines in either accuracy or measured latency.

Figures

Figures reproduced from arXiv: 2606.08605 by Pratuat Amatya, Vinay Setty.

**Figure 1.** Figure 1: The deployed Factiverse fact-checking pipeline. Blue nodes are [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Evaluation of claim detection for 114 languages using Factiverse model (fine-tuned XLM-RoBERTa-Large), GPT-5.2, Claude Opus 4.6 and Qwen3-8b. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Stance-classification (veracity prediction) prompt used for the LLM [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation of veracity prediction for 28 languages using mmBERT-base fine-tuned (Factiverse), GPT-5.2, Claude Opus 4.6 and Qwen3-8b (some [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

We present a multilingual fact-checking system deployed at Factiverse, designed for high-throughput and low-latency operation across diverse languages. The system follows a modular pipeline with three stages: claim detection, evidence retrieval and re-ranking, and veracity prediction. We fine-tune XLM-RoBERTa-Large for claim detection, mmBERT-base for three-label stance classification (Supports/Refutes/Mixed), and a SetFit-based multilingual re-ranker for claim--evidence matching. We compare these components against strong LLM baselines, including GPT-5.2, Claude Opus~4.6, and Qwen3-8b. Experiments on production data spanning 114 languages for claim detection and 28 languages for veracity prediction show that task-specific fine-tuning provides strong and stable multilingual performance, while the fine-tuned retrieval model remains competitive with modern proprietary embeddings. Same-hardware latency measurements further show large efficiency gains for encoder-based components, supporting their use in production deployments with tight cost and privacy constraints. Overall, compact fine-tuned, self-hosted models remain a practical and effective foundation for multilingual fact-checking at scale. Code and data used for this study are available at https://github.com/factiverse/factcheck-editor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuned compact models deliver practical efficiency wins over LLMs on Factiverse production data across many languages, but the results stay tied to internal logs without public benchmark anchors.

read the letter

The one thing to take away is that this paper shows task-specific fine-tuned models holding their own against current LLMs for claim detection and veracity prediction on real production traffic, while cutting latency substantially on identical hardware.

They fine-tune XLM-RoBERTa-Large for claim detection over 114 languages, mmBERT for three-way stance, and a SetFit re-ranker, then run direct comparisons to GPT-5.2, Claude Opus 4.6, and Qwen3-8b. The same-hardware latency figures and the decision to release code plus data on GitHub are the parts that actually move the needle for anyone trying to ship this kind of system without API spend or data leaving the building.

The work is straightforward about the pipeline and the efficiency trade-offs, which is useful when the goal is high-throughput multilingual operation under cost and privacy limits. The language counts are large enough to be worth noticing.

The soft spot is exactly the one the stress-test flags: everything is measured on Factiverse production data. The abstract gives no per-language breakdowns, no language-balance stats, and no numbers on public sets like X-Fact or Multi-FEVER. Production logs tend to favor cases the existing system already handles, so the claim of “strong and stable” performance lacks an external check. That gap is real and limits how far the results travel.

This is for practitioners who need to decide between fine-tuned encoders and LLM APIs for fact-checking pipelines. A reader building or evaluating deployed systems will find the latency data and model choices worth examining. It is solid enough applied work to send to peer review; the experiments are on actual traffic and the comparisons are direct, even if adding public-benchmark results would make the generalization case stronger.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a modular multilingual fact-checking pipeline deployed at Factiverse, with three stages: claim detection via fine-tuned XLM-RoBERTa-Large, evidence retrieval and re-ranking via a SetFit-based model, and veracity prediction via mmBERT-base for three-label stance classification. It evaluates these compact models against LLM baselines (GPT-5.2, Claude Opus 4.6, Qwen3-8b) on Factiverse production data spanning 114 languages for claim detection and 28 languages for veracity prediction, claiming that task-specific fine-tuning yields strong and stable multilingual performance while remaining competitive with proprietary embeddings and delivering large same-hardware latency gains for encoder components. Code and data are released on GitHub.

Significance. If the results hold, the work provides concrete evidence that fine-tuned compact encoder models can serve as a practical, efficient, and privacy-preserving foundation for high-throughput multilingual fact-checking at production scale, outperforming or matching larger LLMs in targeted tasks while reducing latency and cost. The emphasis on real-world deployment metrics and open release of resources strengthens its relevance for applied NLP systems.

major comments (3)

[Abstract] Abstract: The central claim that 'task-specific fine-tuning provides strong and stable multilingual performance' across 114 languages rests exclusively on internal production data; no per-language performance breakdowns, language-balance statistics, or head-to-head results on public benchmarks (e.g., X-Fact, Multi-FEVER) are referenced, leaving the stability and generalizability assertions without an external anchor.
[Abstract] Abstract (and experiments section): Production data are described only at aggregate level (114/28 languages); without details on data collection filters, labeling quality controls, or potential over-representation of high-resource languages or easily detectable claims, it is impossible to assess whether the data distribution matches real-world multilingual fact-checking challenges.
[Abstract] Abstract: The efficiency claim of 'large efficiency gains for encoder-based components' is stated without quantitative latency numbers, hardware specifications, or direct comparison tables against the LLM baselines under identical conditions, weakening the support for the production-deployment recommendation.

minor comments (2)

The abstract references 'Code and data used for this study are available at https://github.com/factiverse/factcheck-editor' but does not specify whether the repository includes exact data splits, evaluation scripts, and model checkpoints needed for full reproducibility.
[Abstract] Model names (XLM-RoBERTa-Large, mmBERT-base, SetFit) and LLM versions (GPT-5.2, Claude Opus 4.6) are given without citation to their original papers or release notes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'task-specific fine-tuning provides strong and stable multilingual performance' across 114 languages rests exclusively on internal production data; no per-language performance breakdowns, language-balance statistics, or head-to-head results on public benchmarks (e.g., X-Fact, Multi-FEVER) are referenced, leaving the stability and generalizability assertions without an external anchor.

Authors: We agree that external validation would strengthen generalizability claims. While the core contribution centers on production-scale deployment, we will add a subsection with results on public benchmarks (X-Fact, Multi-FEVER) and include language-balance statistics plus per-language breakdowns for high-resource languages in the production data. These will be referenced in the abstract. revision: yes
Referee: [Abstract] Abstract (and experiments section): Production data are described only at aggregate level (114/28 languages); without details on data collection filters, labeling quality controls, or potential over-representation of high-resource languages or easily detectable claims, it is impossible to assess whether the data distribution matches real-world multilingual fact-checking challenges.

Authors: We will expand the data section with details on labeling quality controls (including agreement metrics) and language distribution analysis to address potential over-representation. Some collection filters are proprietary and cannot be disclosed in full, but we will provide maximum feasible transparency and discuss limitations. revision: partial
Referee: [Abstract] Abstract: The efficiency claim of 'large efficiency gains for encoder-based components' is stated without quantitative latency numbers, hardware specifications, or direct comparison tables against the LLM baselines under identical conditions, weakening the support for the production-deployment recommendation.

Authors: The experiments section already contains the quantitative latency results, hardware details (e.g., A100 GPUs), and comparison tables. We will revise the abstract to incorporate key numbers (e.g., specific ms-per-example figures) and explicitly reference the tables. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation on external production data and LLM baselines

full rationale

The paper describes an empirical pipeline for multilingual fact-checking using fine-tuned encoder models (XLM-RoBERTa, mmBERT, SetFit) evaluated on Factiverse production data across 114/28 languages, with direct comparisons to external LLM baselines (GPT-5.2, Claude Opus, Qwen3). No equations, derivations, or 'predictions' appear. No self-citations are invoked to justify uniqueness or load-bearing premises. Results are presented as measured performance and latency on held-out production logs, not as quantities forced by fitting or self-definition. This matches the default expectation of a non-circular experimental report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents detailed identification of free parameters or axioms; the approach relies on standard pre-trained language models and fine-tuning procedures from prior work.

pith-pipeline@v0.9.1-grok · 5747 in / 1189 out tokens · 31776 ms · 2026-06-27T18:57:15.135945+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Brenda: Browser extension for fake news detection,

B. Botnevik, E. Sakariassen, and V . Setty, “Brenda: Browser extension for fake news detection,” inProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’20, 2020, p. 2117–2120

2020
[2]

Surprising efficacy of fine-tuned transformers for fact-checking over larger language models,

V . Setty, “Surprising efficacy of fine-tuned transformers for fact-checking over larger language models,” inProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, 2024, pp. 2842–2846

2024
[3]

Generative Large Language Models in Automated Fact-Checking: A Survey

I. Vykopal, M. Pikuliak, S. Ostermann, and M. ˇSimko, “Generative large language models in automated fact-checking: A survey,”arXiv preprint arXiv:2407.02351, 2024

work page internal anchor Pith review arXiv 2024
[4]

Rarr: Researching and revising what language models say, using language models,

L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y . Fan, V . Zhao, N. Lao, H. Lee, D.-C. Juanet al., “Rarr: Researching and revising what language models say, using language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 16 477–16 508

2023
[5]

A survey on automated fact-checking,

Z. Guo, M. Schlichtkrull, and A. Vlachos, “A survey on automated fact-checking,”Transactions of the Association for Computational Lin- guistics, vol. 10, pp. 178–206, 2022

2022
[6]

Claimbuster: the first-ever end-to-end fact-checking system,

N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph, A. Kulkarni, A. K. Nayak, V . Sable, C. Li, and M. Tremayne, “Claimbuster: the first-ever end-to-end fact-checking system,”Proc. VLDB Endow., vol. 10, no. 12, pp. 1945–1948, 2017

1945
[7]

Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society,

F. Alam, S. Shaar, F. Dalvi, H. Sajjad, A. Nikolov, H. Mubarak, G. Da San Martino, A. Abdelali, N. Durrani, K. Darwish, A. Al- Homaid, W. Zaghouani, T. Caselli, G. Danoe, F. Stolk, B. Bruntink, and P. Nakov, “Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society,” in...

2021
[8]

Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual research,

R. Panchendrarajan and A. Zubiaga, “Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual research,”Natural Language Processing Journal, vol. 7, p. 100066, 2024

2024
[9]

Factir: A real-world zero-shot open-domain retrieval benchmark for fact-checking,

V . V and V . Setty, “Factir: A real-world zero-shot open-domain retrieval benchmark for fact-checking,” inCompanion Proceedings of the ACM on Web Conference 2025, ser. WWW ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 809–812. [Online]. Available: https://doi.org/10.1145/3701716.3715300

work page doi:10.1145/3701716.3715300 2025
[10]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’19’, K. Inui, J. Jiang, V . Ng, and X. Wan, Eds. Association for Computational Linguistics, 2019, pp. 3982–3992

2019
[11]

Tunstall, N

L. Tunstall, N. Reimers, U. E. S. Jo, L. Bates, D. Korat, M. Wasserblat, and O. Pereg, “Efficient few-shot learning without prompts,”arXiv preprint arXiv:2209.11055, 2022. [Online]. Available: https://arxiv.org/abs/2209.11055

work page arXiv 2022
[12]

Declare: Debunking fake news and false claims using evidence-aware deep learning,

K. Popat, S. Mukherjee, A. Yates, and G. Weikum, “Declare: Debunking fake news and false claims using evidence-aware deep learning,” in Proceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 22–32

2018
[13]

Averitec: A dataset for real-world claim verification with evidence from the web,

M. Schlichtkrull, Z. Guo, and A. Vlachos, “Averitec: A dataset for real-world claim verification with evidence from the web,”Advances in Neural Information Processing Systems, vol. 36, pp. 65 128–65 167, 2023

2023
[14]

Finding streams in knowledge graphs to support fact checking,

P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, “Finding streams in knowledge graphs to support fact checking,” in2017 IEEE International Conference on Data Mining (ICDM), 2017, pp. 859–864

2017
[15]

Higil: Hierarchical graph inference learning for fact checking,

Q. Mao, Y . Wang, C. Yang, L. Du, H. Peng, J. Wu, J. Li, and Z. Wang, “Higil: Hierarchical graph inference learning for fact checking,” in2022 IEEE International Conference on Data Mining (ICDM), 2022, pp. 337– 346

2022
[16]

Jointly embedding the local and global relations of heterogeneous graph for rumor detection,

C. Yuan, Q. Ma, W. Zhou, J. Han, and S. Hu, “Jointly embedding the local and global relations of heterogeneous graph for rumor detection,” in2019 IEEE International Conference on Data Mining (ICDM), 2019, pp. 796–805

2019
[17]

SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models,

P. Manakul, A. Liusie, and M. Gales, “SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’23, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023, pp. 9004–9017

2023
[18]

Fac- Tool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios,

I.-C. Chern, S. Chern, S. Chen, W. Yuan, K. Feng, C. Zhou, J. He, G. Neubig, and P. Liu, “Factool: Factuality detection in generative ai – a tool augmented framework for multi-task and multi-domain scenarios,” arXiv preprint arXiv:2307.13528, 2023

work page arXiv 2023
[19]

Mishra, A

A. Mishra, A. Asai, V . Balachandran, Y . Wang, G. Neubig, Y . Tsvetkov, and H. Hajishirzi, “Fine-grained hallucination detection and editing for language models,”arXiv preprint arXiv:2401.06855, 2024

work page arXiv 2024
[20]

Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers,

Y . Wang, R. G. Reddy, Z. M. Mujahid, A. Arora, A. Rubashevskii, J. Geng, O. M. Afzal, L. Pan, N. Borenstein, A. Pillai, I. Au- genstein, I. Gurevych, and P. Nakov, “Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers,”arXiv preprint arXiv:2311.09000, 2024

work page arXiv 2024
[21]

A survey on natural language processing for fake news detection,

R. Oshikawa, J. Qian, and W. Y . Wang, “A survey on natural language processing for fake news detection,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 6086–6093

2020
[22]

The perils and promises of fact-checking with large language models,

D. Quelle and A. Bovet, “The perils and promises of fact-checking with large language models,”Frontiers in Artificial Intelligence, vol. 7, p. 1341697, 2024

2024
[23]

Livefc: A system for live fact-checking of audio streams,

V . Venktesh and V . Setty, “Livefc: A system for live fact-checking of audio streams,” in18th ACM International Conference on Web Search and Data Mining, WSDM 2025. Association for Computing Machinery (ACM), 2025, pp. 1060–1063

2025
[24]

Shortcheck: Checkworthiness detection of mul- tilingual short-form videos,

H. Vatndal and V . Setty, “Shortcheck: Checkworthiness detection of mul- tilingual short-form videos,” inThe 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia- Pacific Chapter of the Association for Computational Linguistics: System Demonstrations, 2025, p. 77

2025
[25]

Factcheck editor: Multilingual text editor with end-to-end fact-checking,

V . Setty, “Factcheck editor: Multilingual text editor with end-to-end fact-checking,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2744–2748

2024
[26]

Unsu- pervised cross-lingual representation learning at scale,

A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Unsu- pervised cross-lingual representation learning at scale,” inProceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, ser. ACL ’20, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. ...

2020
[27]

Where the truth lies: Explaining the credibility of emerging claims on the web and social media,

K. Popat, S. Mukherjee, J. Str ¨otgen, and G. Weikum, “Where the truth lies: Explaining the credibility of emerging claims on the web and social media,” inProceedings of the 26th International Conference on World Wide Web Companion, ser. WWW ’17 Companion, 2017, p. 1003–1012

2017
[28]

SADHAN: Hierarchical attention networks to learn latent aspect embeddings for fake news detection,

R. Mishra and V . Setty, “SADHAN: Hierarchical attention networks to learn latent aspect embeddings for fake news detection,” inProceedings of the 2019 ACM SIGIR International Conference on Theory of Informa- tion Retrieval, ser. ICTIR ’19. Association for Computing Machinery, 2019, p. 197–204

2019

[1] [1]

Brenda: Browser extension for fake news detection,

B. Botnevik, E. Sakariassen, and V . Setty, “Brenda: Browser extension for fake news detection,” inProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’20, 2020, p. 2117–2120

2020

[2] [2]

Surprising efficacy of fine-tuned transformers for fact-checking over larger language models,

V . Setty, “Surprising efficacy of fine-tuned transformers for fact-checking over larger language models,” inProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, 2024, pp. 2842–2846

2024

[3] [3]

Generative Large Language Models in Automated Fact-Checking: A Survey

I. Vykopal, M. Pikuliak, S. Ostermann, and M. ˇSimko, “Generative large language models in automated fact-checking: A survey,”arXiv preprint arXiv:2407.02351, 2024

work page internal anchor Pith review arXiv 2024

[4] [4]

Rarr: Researching and revising what language models say, using language models,

L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y . Fan, V . Zhao, N. Lao, H. Lee, D.-C. Juanet al., “Rarr: Researching and revising what language models say, using language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 16 477–16 508

2023

[5] [5]

A survey on automated fact-checking,

Z. Guo, M. Schlichtkrull, and A. Vlachos, “A survey on automated fact-checking,”Transactions of the Association for Computational Lin- guistics, vol. 10, pp. 178–206, 2022

2022

[6] [6]

Claimbuster: the first-ever end-to-end fact-checking system,

N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph, A. Kulkarni, A. K. Nayak, V . Sable, C. Li, and M. Tremayne, “Claimbuster: the first-ever end-to-end fact-checking system,”Proc. VLDB Endow., vol. 10, no. 12, pp. 1945–1948, 2017

1945

[7] [7]

Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society,

F. Alam, S. Shaar, F. Dalvi, H. Sajjad, A. Nikolov, H. Mubarak, G. Da San Martino, A. Abdelali, N. Durrani, K. Darwish, A. Al- Homaid, W. Zaghouani, T. Caselli, G. Danoe, F. Stolk, B. Bruntink, and P. Nakov, “Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society,” in...

2021

[8] [8]

Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual research,

R. Panchendrarajan and A. Zubiaga, “Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual research,”Natural Language Processing Journal, vol. 7, p. 100066, 2024

2024

[9] [9]

Factir: A real-world zero-shot open-domain retrieval benchmark for fact-checking,

V . V and V . Setty, “Factir: A real-world zero-shot open-domain retrieval benchmark for fact-checking,” inCompanion Proceedings of the ACM on Web Conference 2025, ser. WWW ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 809–812. [Online]. Available: https://doi.org/10.1145/3701716.3715300

work page doi:10.1145/3701716.3715300 2025

[10] [10]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’19’, K. Inui, J. Jiang, V . Ng, and X. Wan, Eds. Association for Computational Linguistics, 2019, pp. 3982–3992

2019

[11] [11]

Tunstall, N

L. Tunstall, N. Reimers, U. E. S. Jo, L. Bates, D. Korat, M. Wasserblat, and O. Pereg, “Efficient few-shot learning without prompts,”arXiv preprint arXiv:2209.11055, 2022. [Online]. Available: https://arxiv.org/abs/2209.11055

work page arXiv 2022

[12] [12]

Declare: Debunking fake news and false claims using evidence-aware deep learning,

K. Popat, S. Mukherjee, A. Yates, and G. Weikum, “Declare: Debunking fake news and false claims using evidence-aware deep learning,” in Proceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 22–32

2018

[13] [13]

Averitec: A dataset for real-world claim verification with evidence from the web,

M. Schlichtkrull, Z. Guo, and A. Vlachos, “Averitec: A dataset for real-world claim verification with evidence from the web,”Advances in Neural Information Processing Systems, vol. 36, pp. 65 128–65 167, 2023

2023

[14] [14]

Finding streams in knowledge graphs to support fact checking,

P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, “Finding streams in knowledge graphs to support fact checking,” in2017 IEEE International Conference on Data Mining (ICDM), 2017, pp. 859–864

2017

[15] [15]

Higil: Hierarchical graph inference learning for fact checking,

Q. Mao, Y . Wang, C. Yang, L. Du, H. Peng, J. Wu, J. Li, and Z. Wang, “Higil: Hierarchical graph inference learning for fact checking,” in2022 IEEE International Conference on Data Mining (ICDM), 2022, pp. 337– 346

2022

[16] [16]

Jointly embedding the local and global relations of heterogeneous graph for rumor detection,

C. Yuan, Q. Ma, W. Zhou, J. Han, and S. Hu, “Jointly embedding the local and global relations of heterogeneous graph for rumor detection,” in2019 IEEE International Conference on Data Mining (ICDM), 2019, pp. 796–805

2019

[17] [17]

SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models,

P. Manakul, A. Liusie, and M. Gales, “SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’23, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023, pp. 9004–9017

2023

[18] [18]

Fac- Tool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios,

I.-C. Chern, S. Chern, S. Chen, W. Yuan, K. Feng, C. Zhou, J. He, G. Neubig, and P. Liu, “Factool: Factuality detection in generative ai – a tool augmented framework for multi-task and multi-domain scenarios,” arXiv preprint arXiv:2307.13528, 2023

work page arXiv 2023

[19] [19]

Mishra, A

A. Mishra, A. Asai, V . Balachandran, Y . Wang, G. Neubig, Y . Tsvetkov, and H. Hajishirzi, “Fine-grained hallucination detection and editing for language models,”arXiv preprint arXiv:2401.06855, 2024

work page arXiv 2024

[20] [20]

Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers,

Y . Wang, R. G. Reddy, Z. M. Mujahid, A. Arora, A. Rubashevskii, J. Geng, O. M. Afzal, L. Pan, N. Borenstein, A. Pillai, I. Au- genstein, I. Gurevych, and P. Nakov, “Factcheck-bench: Fine-grained evaluation benchmark for automatic fact-checkers,”arXiv preprint arXiv:2311.09000, 2024

work page arXiv 2024

[21] [21]

A survey on natural language processing for fake news detection,

R. Oshikawa, J. Qian, and W. Y . Wang, “A survey on natural language processing for fake news detection,” inProceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 6086–6093

2020

[22] [22]

The perils and promises of fact-checking with large language models,

D. Quelle and A. Bovet, “The perils and promises of fact-checking with large language models,”Frontiers in Artificial Intelligence, vol. 7, p. 1341697, 2024

2024

[23] [23]

Livefc: A system for live fact-checking of audio streams,

V . Venktesh and V . Setty, “Livefc: A system for live fact-checking of audio streams,” in18th ACM International Conference on Web Search and Data Mining, WSDM 2025. Association for Computing Machinery (ACM), 2025, pp. 1060–1063

2025

[24] [24]

Shortcheck: Checkworthiness detection of mul- tilingual short-form videos,

H. Vatndal and V . Setty, “Shortcheck: Checkworthiness detection of mul- tilingual short-form videos,” inThe 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia- Pacific Chapter of the Association for Computational Linguistics: System Demonstrations, 2025, p. 77

2025

[25] [25]

Factcheck editor: Multilingual text editor with end-to-end fact-checking,

V . Setty, “Factcheck editor: Multilingual text editor with end-to-end fact-checking,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2744–2748

2024

[26] [26]

Unsu- pervised cross-lingual representation learning at scale,

A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Unsu- pervised cross-lingual representation learning at scale,” inProceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, ser. ACL ’20, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. ...

2020

[27] [27]

Where the truth lies: Explaining the credibility of emerging claims on the web and social media,

K. Popat, S. Mukherjee, J. Str ¨otgen, and G. Weikum, “Where the truth lies: Explaining the credibility of emerging claims on the web and social media,” inProceedings of the 26th International Conference on World Wide Web Companion, ser. WWW ’17 Companion, 2017, p. 1003–1012

2017

[28] [28]

SADHAN: Hierarchical attention networks to learn latent aspect embeddings for fake news detection,

R. Mishra and V . Setty, “SADHAN: Hierarchical attention networks to learn latent aspect embeddings for fake news detection,” inProceedings of the 2019 ACM SIGIR International Conference on Theory of Informa- tion Retrieval, ser. ICTIR ’19. Association for Computing Machinery, 2019, p. 197–204

2019