pith. machine review for the scientific record.
sign in

arxiv: 2511.01101 · v2 · submitted 2025-11-02 · 💻 cs.CL

TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

Pith reviewed 2026-05-18 00:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords fact verificationtime seriesbenchmarktemporal reasoningnumerical reasoninglarge language modelsclaim verification
0
0 comments X

The pith

TSVer benchmark shows that even leading AI models struggle to verify claims against time-series evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TSVer as a benchmark to evaluate how well fact verification systems handle claims that depend on temporal and numerical reasoning over time-series data. It assembles 304 real claims drawn from 41 fact-checking organizations and pairs them with a database of 400 curated time series spanning multiple domains. Each claim receives annotations that specify relevant time frames across the series, deliver a verdict, and supply justifications that trace how the evidence produces the verdict. The authors report that current state-of-the-art models reach only 63.57 percent accuracy on verdicts and 47.36 on explanation quality, indicating that time-series evidence remains difficult for existing approaches.

Core claim

TSVer demonstrates that fact verification against time-series evidence requires explicit handling of time frames, numerical comparisons, and trend analysis that current models do not perform reliably. The benchmark supplies real claims, structured time series, and multi-step annotations that record which portions of the evidence support each verdict, allowing direct measurement of both verdict correctness and justification quality.

What carries the argument

The TSVer benchmark dataset, consisting of real-world claims annotated with time frames, verdicts, and justifications derived from 400 time series.

If this is right

  • Fact verification pipelines will need dedicated modules for aligning claims to specific time intervals and performing numerical operations across series.
  • Progress on TSVer should translate to improved handling of statistical claims in real news and public discourse.
  • The provided time-frame annotations enable development of models that can cite exact periods rather than whole series.
  • High inter-annotator agreement on verdicts supports reliable comparison of future systems against the reported baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended with additional series from emerging domains such as climate or financial indicators to test generalization.
  • Systems that succeed on TSVer may also improve at verifying claims that combine time series with other structured sources like tables or graphs.
  • The annotation process itself offers a template for creating similar resources focused on other forms of quantitative evidence.

Load-bearing premise

The 304 selected claims and 400 time series, together with their LLM-assisted annotations, accurately represent the range of real-world fact verification tasks that involve temporal and numerical evidence.

What would settle it

A model achieving substantially higher than 63.57 percent verdict accuracy and 47.36 Ev2R score on the full TSVer test set, while using only the provided time-series database and without having seen the annotations during training, would indicate that the claimed difficulty has been overcome.

Figures

Figures reproduced from arXiv: 2511.01101 by Andreas Vlachos, Marek Strong.

Figure 1
Figure 1. Figure 1: Example claim from TSVer. Our dataset includes real-world claims paired with historical time￾series evidence. All claims are annotated with time ranges (blue boxes), verdicts, and justifications empha￾sizing numerical and temporal reasoning. assessing claims that rely on external evidence (Fontana et al., 2025) or when evaluating claims requires deeper reasoning beyond surface-level tex￾tual cues (Choi and… view at source ↗
Figure 2
Figure 2. Figure 2: TSVer data collection pipeline. annotation. An overview of the entire process is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Instructions given to participants at the beginning of the annotation session. These instructions were [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detailed step-by-step tutorial explaining the annotation study [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Annotation Interface for Phase 1 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Annotation Interface for Phase 2 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Top 20 countries by share of claims in the benchmark dataset. Bars indicate the percentage of claims [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Share of fact-checked claims by publishing organization ( [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Reasoning over temporal and numerical data, such as time series, is a crucial aspect of fact-checking. While many systems have recently been developed to handle this form of evidence, their evaluation remains limited by existing datasets, which often lack structured evidence, provide insufficient justifications for verdicts, or rely on synthetic claims. In this paper, we introduce TSVer, a new benchmark dataset for fact verification focusing on temporal and numerical reasoning with time-series evidence. TSVer contains 304 real-world claims sourced from 41 fact-checking organizations and a curated database of 400 time series covering diverse domains. Each claim is annotated with time frames across all pertinent time series, along with a verdict and justifications reflecting how the evidence is used to reach the verdict. Using an LLM-assisted multi-step annotation process, we improve the quality of our annotations and achieve an inter-annotator agreement of kappa=0.77 on verdicts. We also develop a baseline for verifying claims against time-series evidence and show that even the state-of-the-art reasoning models like Gemini-2.5-Pro are challenged by time series, achieving a 63.57 accuracy score on verdicts and an Ev2R score of 47.36 on verdict justifications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TSVer, a benchmark dataset for fact verification that emphasizes temporal and numerical reasoning over time-series evidence. It comprises 304 real-world claims drawn from 41 fact-checking organizations paired with a curated collection of 400 time series spanning diverse domains. Each claim receives annotations for relevant time frames across the series, a verdict, and justifications derived via an LLM-assisted multi-step process that yields an inter-annotator agreement of kappa=0.77 on verdicts. Baseline experiments show that even strong reasoning models such as Gemini-2.5-Pro reach only 63.57% verdict accuracy and 47.36 on the Ev2R justification metric, underscoring the difficulty of the task.

Significance. If the annotations are shown to be faithful, TSVer would fill an important gap by supplying structured, real-world time-series evidence for fact verification evaluation, moving beyond synthetic claims or unstructured evidence in prior datasets. The curation of authentic claims and the reported performance shortfalls of current SOTA models provide a concrete, falsifiable testbed that could stimulate targeted advances in temporal reasoning. The use of real claims from multiple organizations and the multi-domain time-series collection are clear strengths.

major comments (2)
  1. [Annotation Process] Annotation Process section: The LLM-assisted pipeline for producing time-frame, verdict, and justification labels reports only an aggregate kappa=0.77 on verdicts. No per-annotator confusion matrix, no human re-annotation of a held-out subset, and no analysis of cases in which the LLM step changed the original fact-checker verdict are supplied. Because the benchmark's utility as a test of temporal/numerical reasoning rests on these 304 tuples faithfully reflecting how the time series support or refute each claim, this omission is load-bearing for the central contribution.
  2. [Dataset Curation] Dataset Curation section: The abstract and description provide limited detail on claim selection criteria, the time-series curation process, and potential selection biases introduced by the LLM-assisted steps. Without these specifics, it is difficult to assess whether the 304 claims and 400 series accurately capture real-world fact-verification scenarios involving temporal evidence.
minor comments (2)
  1. [Abstract] Abstract: The Ev2R justification metric is referenced without a definition or pointer to its formal description; a one-sentence gloss or section reference would improve immediate readability.
  2. [Experiments] Experiments section: Baseline model configurations, prompting strategies, and exact input formats for the time-series evidence should be tabulated for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential significance of TSVer as a benchmark for temporal and numerical reasoning in fact verification. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Annotation Process] Annotation Process section: The LLM-assisted pipeline for producing time-frame, verdict, and justification labels reports only an aggregate kappa=0.77 on verdicts. No per-annotator confusion matrix, no human re-annotation of a held-out subset, and no analysis of cases in which the LLM step changed the original fact-checker verdict are supplied. Because the benchmark's utility as a test of temporal/numerical reasoning rests on these 304 tuples faithfully reflecting how the time series support or refute each claim, this omission is load-bearing for the central contribution.

    Authors: We agree that the current description of the annotation process is insufficiently detailed. In the revised manuscript we will add a per-annotator confusion matrix for verdicts, describe the human re-annotation performed on a held-out subset, and include an analysis of cases where the LLM-assisted step produced a verdict different from the original fact-checker label. These additions will directly address concerns about annotation faithfulness. revision: yes

  2. Referee: [Dataset Curation] Dataset Curation section: The abstract and description provide limited detail on claim selection criteria, the time-series curation process, and potential selection biases introduced by the LLM-assisted steps. Without these specifics, it is difficult to assess whether the 304 claims and 400 series accurately capture real-world fact-verification scenarios involving temporal evidence.

    Authors: We acknowledge that the manuscript currently provides limited information on these aspects. The revised Dataset Curation section will specify the claim selection criteria applied to the 41 fact-checking organizations, detail the time-series curation methodology across domains, and discuss potential selection biases, including those arising from LLM-assisted steps. This expanded description will allow readers to evaluate the benchmark's representativeness. revision: yes

Circularity Check

0 steps flagged

No circularity in dataset creation or baseline evaluation

full rationale

The paper introduces TSVer as a benchmark by sourcing 304 real-world claims from external fact-checking organizations and curating 400 time series from diverse domains. Annotations for time frames, verdicts, and justifications are produced via an LLM-assisted pipeline with reported IAA (kappa=0.77), followed by standard baseline evaluations of models such as Gemini-2.5-Pro. No equations, fitted parameters, predictions, or first-principles derivations appear; the central claims rest on external data sources and empirical measurement rather than any self-definitional reduction, fitted-input renaming, or self-citation chain. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on domain assumptions about representativeness of selected claims and time series rather than free parameters or new invented entities; no mathematical derivations are involved.

axioms (1)
  • domain assumption Real-world claims from fact-checking organizations can be reliably paired with and verified against curated time-series evidence.
    This premise underpins the construction of the 304 claims and 400 time series in the benchmark.

pith-pipeline@v0.9.0 · 5741 in / 1343 out tokens · 36852 ms · 2026-05-18T00:52:28.508600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce TSVer, a new benchmark dataset for fact verification focusing on temporal and numerical reasoning with time-series evidence. TSVer contains 304 real-world claims ... and a curated database of 400 time series

  • IndisputableMonolith/Foundation/Atomicity.lean atomic_tick unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Using an LLM-assisted multi-step annotation process, we improve the quality of our annotations and achieve an inter-annotator agreement of kappa=0.77 on verdicts.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 10 internal anchors

  1. [1]

    Zoë Adams, Magda Osman, Christos Bechlivanidis, and Björn Meder. 2023. https://doi.org/10.1177/17456916221141344 (why) is misinformation a problem? Perspectives on Psychological Science, 18(6):1436--1463

  2. [2]

    Mubashara Akhtar, Michael Schlichtkrull, and Andreas Vlachos. 2024. https://doi.org/10.48550/arxiv.2411.05375 Ev2r: Evaluating evidence retrieval in automated fact-checking . arXiv

  3. [3]

    Mubashara Akhtar, Abhilash Shankarampeta, Vivek Gupta, Arpit Patil, Oana Cocarascu, and Elena Simperl. 2023 a . https://doi.org/10.48550/arxiv.2311.02216 Exploring the numerical reasoning capabilities of language models: A comprehensive analysis on tabular data . arXiv

  4. [4]

    Mubashara Akhtar, Nikesh Subedi, Vivek Gupta, Sahar Tahmasebi, Oana Cocarascu, and Elena Simperl. 2023 b . https://doi.org/10.48550/arxiv.2311.07453 ChartCheck : Explainable fact-checking over real-world chart images . arXiv

  5. [5]

    Firoj Alam, Stefano Cresci, Tanmoy Chakraborty, Fabrizio Silvestri, Dimiter Dimitrov, Giovanni Da San Martino, Shaden Shaar, Hamed Firooz, and Preslav Nakov. 2021. https://doi.org/10.48550/arxiv.2103.12541 A survey on multimodal disinformation detection . arXiv

  6. [6]

    Liesbeth Allein, Isabelle Augenstein, and Marie-Francine Moens. 2020. https://doi.org/10.48550/arxiv.2009.06402 Time-aware evidence ranking for fact-checking . arXiv , 71:100663

  7. [7]

    Liesbeth Allein, Marlon Saelens, Ruben Cartuyvels, and Marie-Francine Moens. 2023. https://doi.org/10.48550/arxiv.2302.12569 Implicit temporal reasoning for evidence-based fact-checking . arXiv

  8. [8]

    Rami Aly, Zhijiang Guo, Michael Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. https://doi.org/10.48550/arxiv.2106.05707 FEVEROUS : Fact extraction and VERification over unstructured and structured information . arXiv . FEVEROUS

  9. [9]

    Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, and 1 others. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

  10. [10]

    Anthropic. 2025. Claude 3.7 sonnet [large language model]. https://www.anthropic.com/claude/sonnet. Accessed: 2025-05-20

  11. [11]

    Phoebe Arnold. 2020. The challenges of online fact checking. Technical report, Technical report, Full Fact

  12. [12]

    Abolfazl Asudeh, H V Jagadish, You (Will) Wu, and Cong Yu. 2020. https://doi.org/10.14778/3380750.3380762 On detecting cherry-picked trendlines . Proceedings of the VLDB Endowment , 13(6):939--952

  13. [13]

    Satanjeev Banerjee and Alon Lavie. 2005. https://aclanthology.org/W05-0909 METEOR : An automatic metric for MT evaluation with improved correlation with human judgments . In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , pages 65--72, Ann Arbor, Michigan. Association for Comput...

  14. [14]

    Anab Maulana Barik, Wynne Hsu, and Mong Li Lee. 2024 a . https://doi.org/10.48550/arxiv.2410.14964 ChronoFact : Timeline-based temporal fact verification . arXiv . ChronoClaims

  15. [15]

    Anab Maulana Barik, Wynne Hsu, and Mong-Li Lee. 2024 b . https://doi.org/10.18653/v1/2024.emnlp-industry.48 Time matters: An end-to-end solution for temporal claim verification . Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 657--664. T- FEVER . T- FEVEROUS

  16. [16]

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. 2022. https://doi.org/10.48550/arxiv.2209.03143 AudioLM : a language modeling approach to audio generation . arXiv

  17. [17]

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. https://doi.org/10.48550/arxiv.2303.12712 Sparks of artificial general intelligence: Early experiments with GPT -4 . arXiv

  18. [18]

    Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2019. https://doi.org/10.48550/arxiv.1909.02164 TabFact : A large-scale dataset for table-based fact verification . arXiv

  19. [19]

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, and Ion Stoica. 2024. https://doi.org/10.48550/arxiv.2403.04132 Chatbot arena: An open platform for evaluating LLMs by human preference . arXiv

  20. [20]

    Eun Cheol Choi and Emilio Ferrara. 2023. https://doi.org/10.48550/arxiv.2310.09223 Automated claim matching with large language models: Empowering fact-checkers in the fight against misinformation . arXiv

  21. [21]

    Yiqun Duan, Jinzhao Zhou, Zhen Wang, Yu-Kai Wang, and Chin-Teng Lin. 2023. https://doi.org/10.48550/arxiv.2309.14030 DeWave : Discrete EEG waves encoding for brain dynamics to text translation . arXiv

  22. [22]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, and 516 others. 2024. https://doi.org/10.48550/arxiv.2407.21783 The llama 3 ...

  23. [23]

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. 2023. https://doi.org/10.48550/arxiv.2305.18654 Faith and fate: Limits of transformers on compositionality . arXiv

  24. [24]

    Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. 2024. https://doi.org/10.48550/arxiv.2404.16563 Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark . arXiv

  25. [25]

    Nicolo' Fontana, Francesco Corso, Enrico Zuccolotto, and Francesco Pierri. 2025. https://doi.org/10.48550/arxiv.2503.05565 Evaluating open-source large language models for automated fact-checking . arXiv

  26. [26]

    Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. 2023. https://doi.org/10.48550/arxiv.2310.07820 Large language models are zero-shot time series forecasters . arXiv

  27. [27]

    Zihui Gu, Ju Fan, Nan Tang, Preslav Nakov, Xiaoman Zhao, and Xiaoyong Du. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.331 PASTA : Table-operations aware fact verification via sentence-table cloze pre-training . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4971--4983, Abu Dhabi, United Arab Emirates...

  28. [28]

    Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos. 2022. https://arxiv.org/abs/2108.11896 A survey on automated fact-checking

  29. [29]

    Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. 2017. Toward automated fact-checking: Detecting check-worthy factual claims by claimbuster. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1803--1812

  30. [30]

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. https://doi.org/10.48550/arxiv.1904.09751 The curious case of neural text degeneration . arXiv . Nucleus Sampling

  31. [31]

    Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. https://doi.org/10.5281/zenodo.1212303 spaCy: Industrial-strength Natural Language Processing in Python

  32. [32]

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://doi.org/10.48550/arxi...

  33. [33]

    Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. 2023. https://doi.org/10.48550/arxiv.2310.01728 Time- LLM : Time series forecasting by reprogramming large language models . arXiv

  34. [34]

    Neema Kotonya and Francesca Toni. 2020. https://doi.org/10.48550/arxiv.2011.03870 Explainable automated fact-checking: A survey . arXiv

  35. [35]

    Miaoran Li, Baolin Peng, Michel Galley, Jianfeng Gao, and Zhu Zhang. 2023. https://doi.org/10.48550/arxiv.2305.14623 Self-checker: Plug-and-play modules for fact-checking with large language models . arXiv

  36. [36]

    Xinyuan Lu, Liangming Pan, Qian Liu, Preslav Nakov, and Min-Yen Kan. 2023. https://doi.org/10.48550/arxiv.2305.13186 SCITAB : A challenging benchmark for compositional reasoning and claim verification on scientific tables . arXiv

  37. [37]

    Mike A Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff. 2024. https://doi.org/10.48550/arxiv.2404.11757 Language models still struggle to zero-shot reason about time series . arXiv

  38. [38]

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.741 FA ct S core: Fine-grained atomic evaluation of factual precision in long form text generation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...

  39. [39]

    OpenAI . 2023. https://doi.org/10.48550/arxiv.2303.08774 GPT -4 technical report . arXiv . GPT -4

  40. [40]

    Nedjma Ousidhoum, Zhangdie Yuan, and Andreas Vlachos. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.163 Varifocal question generation for fact-checking . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2532--2544, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

  41. [41]

    Justus J Randolph. 2005. Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss' fixed-marginal multirater kappa. Online submission

  42. [42]

    Daniel Russo, Serra Sinem Tekiroglu, and Marco Guerini. 2023. https://doi.org/10.48550/arxiv.2308.15202 Benchmarking the generation of fact checking explanations . arXiv

  43. [43]

    Michael Schlichtkrull, Zhijiang Guo, and Andreas Vlachos. 2023. https://doi.org/10.48550/arxiv.2305.13117 AVeriTeC : A dataset for real-world claim verification with evidence from the web . arXiv . AVeriTec

  44. [44]

    Dimitris Spathis and Fahim Kawsar. 2023. https://doi.org/10.48550/arxiv.2309.06236 The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models . arXiv

  45. [45]

    Marek Strong, Rami Aly, and Andreas Vlachos. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.991 Zero-shot fact verification via natural logic and large language models . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17021--17035, Miami, Florida, USA. Association for Computational Linguistics

  46. [46]

    Marek Strong, Jonas Rohnke, Antonio Bonafonte, Mateusz Łajszczak, and Trevor Wood. 2021. https://doi.org/10.48550/arxiv.2110.12539 Discrete acoustic space for an efficient sampling in neural text-to-speech . arXiv

  47. [47]

    Jannik Str \"o tgen and Michael Gertz. 2010. https://aclanthology.org/S10-1071 H eidel T ime: High quality rule-based extraction and normalization of temporal expressions . In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 321--324, Uppsala, Sweden. Association for Computational Linguistics

  48. [48]

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. https://doi.org/10.18653/v1/N18-1074 FEVER : a large-scale dataset for fact extraction and VER ification . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pap...

  49. [49]

    Venktesh V, Abhijit Anand, Avishek Anand, and Vinay Setty. 2024. https://doi.org/10.48550/arxiv.2403.17169 QuanTemp : A real-world open-domain benchmark for fact-checking numerical claims . arXiv

  50. [50]

    Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, and Marián Šimko. 2024. https://doi.org/10.48550/arxiv.2407.02351 Generative large language models in automated fact-checking: A survey . arXiv

  51. [51]

    Gengyu Wang, Kate Harwood, Lawrence Chillrud, Amith Ananthram, Melanie Subbiah, and Kathleen McKeown . 2023. https://doi.org/10.48550/arxiv.2305.18265 Check- COVID : Fact-checking COVID -19 news claims with scientific evidence . arXiv

  52. [52]

    Greta Warren, Irina Shklovski, and Isabelle Augenstein. 2025. https://doi.org/10.48550/arxiv.2502.09083 Show me the work: Fact-checkers' requirements for explainable automated fact-checking . arXiv

  53. [53]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. https://doi.org/10.48550/arxiv.2201.11903 Chain of thought prompting elicits reasoning in large language models . arXiv

  54. [54]

    Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2024. https://doi.org/10.48550/arxiv.2402.02592 Unified training of universal time series forecasting transformers . arXiv

  55. [55]

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. https://doi.org/10.48550/arxiv.2305.11000 SpeechGPT : Empowering large language models with intrinsic cross-modal conversational abilities . arXiv

  56. [56]

    Yilun Zhao, Yitao Long, Yuru Jiang, Chengye Wang, Weiyuan Chen, Hongjun Liu, Yiming Zhang, Xiangru Tang, Chen Zhao, and Arman Cohan. 2024. https://doi.org/10.48550/arxiv.2411.05764 FinDVer : Explainable claim verification over long and hybrid-content financial documents . arXiv

  57. [57]

    Tian Zhou, PeiSong Niu, Xue Wang, Liang Sun, and Rong Jin. 2023. https://doi.org/10.48550/arxiv.2302.11939 One fits all:power general time series analysis by pretrained LM . arXiv

  58. [58]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  59. [59]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...