TSVer: A Benchmark for Fact Verification Against Time-Series Evidence
Pith reviewed 2026-05-18 00:52 UTC · model grok-4.3
The pith
TSVer benchmark shows that even leading AI models struggle to verify claims against time-series evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TSVer demonstrates that fact verification against time-series evidence requires explicit handling of time frames, numerical comparisons, and trend analysis that current models do not perform reliably. The benchmark supplies real claims, structured time series, and multi-step annotations that record which portions of the evidence support each verdict, allowing direct measurement of both verdict correctness and justification quality.
What carries the argument
The TSVer benchmark dataset, consisting of real-world claims annotated with time frames, verdicts, and justifications derived from 400 time series.
If this is right
- Fact verification pipelines will need dedicated modules for aligning claims to specific time intervals and performing numerical operations across series.
- Progress on TSVer should translate to improved handling of statistical claims in real news and public discourse.
- The provided time-frame annotations enable development of models that can cite exact periods rather than whole series.
- High inter-annotator agreement on verdicts supports reliable comparison of future systems against the reported baselines.
Where Pith is reading between the lines
- The benchmark could be extended with additional series from emerging domains such as climate or financial indicators to test generalization.
- Systems that succeed on TSVer may also improve at verifying claims that combine time series with other structured sources like tables or graphs.
- The annotation process itself offers a template for creating similar resources focused on other forms of quantitative evidence.
Load-bearing premise
The 304 selected claims and 400 time series, together with their LLM-assisted annotations, accurately represent the range of real-world fact verification tasks that involve temporal and numerical evidence.
What would settle it
A model achieving substantially higher than 63.57 percent verdict accuracy and 47.36 Ev2R score on the full TSVer test set, while using only the provided time-series database and without having seen the annotations during training, would indicate that the claimed difficulty has been overcome.
Figures
read the original abstract
Reasoning over temporal and numerical data, such as time series, is a crucial aspect of fact-checking. While many systems have recently been developed to handle this form of evidence, their evaluation remains limited by existing datasets, which often lack structured evidence, provide insufficient justifications for verdicts, or rely on synthetic claims. In this paper, we introduce TSVer, a new benchmark dataset for fact verification focusing on temporal and numerical reasoning with time-series evidence. TSVer contains 304 real-world claims sourced from 41 fact-checking organizations and a curated database of 400 time series covering diverse domains. Each claim is annotated with time frames across all pertinent time series, along with a verdict and justifications reflecting how the evidence is used to reach the verdict. Using an LLM-assisted multi-step annotation process, we improve the quality of our annotations and achieve an inter-annotator agreement of kappa=0.77 on verdicts. We also develop a baseline for verifying claims against time-series evidence and show that even the state-of-the-art reasoning models like Gemini-2.5-Pro are challenged by time series, achieving a 63.57 accuracy score on verdicts and an Ev2R score of 47.36 on verdict justifications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TSVer, a benchmark dataset for fact verification that emphasizes temporal and numerical reasoning over time-series evidence. It comprises 304 real-world claims drawn from 41 fact-checking organizations paired with a curated collection of 400 time series spanning diverse domains. Each claim receives annotations for relevant time frames across the series, a verdict, and justifications derived via an LLM-assisted multi-step process that yields an inter-annotator agreement of kappa=0.77 on verdicts. Baseline experiments show that even strong reasoning models such as Gemini-2.5-Pro reach only 63.57% verdict accuracy and 47.36 on the Ev2R justification metric, underscoring the difficulty of the task.
Significance. If the annotations are shown to be faithful, TSVer would fill an important gap by supplying structured, real-world time-series evidence for fact verification evaluation, moving beyond synthetic claims or unstructured evidence in prior datasets. The curation of authentic claims and the reported performance shortfalls of current SOTA models provide a concrete, falsifiable testbed that could stimulate targeted advances in temporal reasoning. The use of real claims from multiple organizations and the multi-domain time-series collection are clear strengths.
major comments (2)
- [Annotation Process] Annotation Process section: The LLM-assisted pipeline for producing time-frame, verdict, and justification labels reports only an aggregate kappa=0.77 on verdicts. No per-annotator confusion matrix, no human re-annotation of a held-out subset, and no analysis of cases in which the LLM step changed the original fact-checker verdict are supplied. Because the benchmark's utility as a test of temporal/numerical reasoning rests on these 304 tuples faithfully reflecting how the time series support or refute each claim, this omission is load-bearing for the central contribution.
- [Dataset Curation] Dataset Curation section: The abstract and description provide limited detail on claim selection criteria, the time-series curation process, and potential selection biases introduced by the LLM-assisted steps. Without these specifics, it is difficult to assess whether the 304 claims and 400 series accurately capture real-world fact-verification scenarios involving temporal evidence.
minor comments (2)
- [Abstract] Abstract: The Ev2R justification metric is referenced without a definition or pointer to its formal description; a one-sentence gloss or section reference would improve immediate readability.
- [Experiments] Experiments section: Baseline model configurations, prompting strategies, and exact input formats for the time-series evidence should be tabulated for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential significance of TSVer as a benchmark for temporal and numerical reasoning in fact verification. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Annotation Process] Annotation Process section: The LLM-assisted pipeline for producing time-frame, verdict, and justification labels reports only an aggregate kappa=0.77 on verdicts. No per-annotator confusion matrix, no human re-annotation of a held-out subset, and no analysis of cases in which the LLM step changed the original fact-checker verdict are supplied. Because the benchmark's utility as a test of temporal/numerical reasoning rests on these 304 tuples faithfully reflecting how the time series support or refute each claim, this omission is load-bearing for the central contribution.
Authors: We agree that the current description of the annotation process is insufficiently detailed. In the revised manuscript we will add a per-annotator confusion matrix for verdicts, describe the human re-annotation performed on a held-out subset, and include an analysis of cases where the LLM-assisted step produced a verdict different from the original fact-checker label. These additions will directly address concerns about annotation faithfulness. revision: yes
-
Referee: [Dataset Curation] Dataset Curation section: The abstract and description provide limited detail on claim selection criteria, the time-series curation process, and potential selection biases introduced by the LLM-assisted steps. Without these specifics, it is difficult to assess whether the 304 claims and 400 series accurately capture real-world fact-verification scenarios involving temporal evidence.
Authors: We acknowledge that the manuscript currently provides limited information on these aspects. The revised Dataset Curation section will specify the claim selection criteria applied to the 41 fact-checking organizations, detail the time-series curation methodology across domains, and discuss potential selection biases, including those arising from LLM-assisted steps. This expanded description will allow readers to evaluate the benchmark's representativeness. revision: yes
Circularity Check
No circularity in dataset creation or baseline evaluation
full rationale
The paper introduces TSVer as a benchmark by sourcing 304 real-world claims from external fact-checking organizations and curating 400 time series from diverse domains. Annotations for time frames, verdicts, and justifications are produced via an LLM-assisted pipeline with reported IAA (kappa=0.77), followed by standard baseline evaluations of models such as Gemini-2.5-Pro. No equations, fitted parameters, predictions, or first-principles derivations appear; the central claims rest on external data sources and empirical measurement rather than any self-definitional reduction, fitted-input renaming, or self-citation chain. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world claims from fact-checking organizations can be reliably paired with and verified against curated time-series evidence.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArrowOfTime.leanarrow_from_z unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce TSVer, a new benchmark dataset for fact verification focusing on temporal and numerical reasoning with time-series evidence. TSVer contains 304 real-world claims ... and a curated database of 400 time series
-
IndisputableMonolith/Foundation/Atomicity.leanatomic_tick unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using an LLM-assisted multi-step annotation process, we improve the quality of our annotations and achieve an inter-annotator agreement of kappa=0.77 on verdicts.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Zoë Adams, Magda Osman, Christos Bechlivanidis, and Björn Meder. 2023. https://doi.org/10.1177/17456916221141344 (why) is misinformation a problem? Perspectives on Psychological Science, 18(6):1436--1463
-
[2]
Mubashara Akhtar, Michael Schlichtkrull, and Andreas Vlachos. 2024. https://doi.org/10.48550/arxiv.2411.05375 Ev2r: Evaluating evidence retrieval in automated fact-checking . arXiv
-
[3]
Mubashara Akhtar, Abhilash Shankarampeta, Vivek Gupta, Arpit Patil, Oana Cocarascu, and Elena Simperl. 2023 a . https://doi.org/10.48550/arxiv.2311.02216 Exploring the numerical reasoning capabilities of language models: A comprehensive analysis on tabular data . arXiv
-
[4]
Mubashara Akhtar, Nikesh Subedi, Vivek Gupta, Sahar Tahmasebi, Oana Cocarascu, and Elena Simperl. 2023 b . https://doi.org/10.48550/arxiv.2311.07453 ChartCheck : Explainable fact-checking over real-world chart images . arXiv
-
[5]
Firoj Alam, Stefano Cresci, Tanmoy Chakraborty, Fabrizio Silvestri, Dimiter Dimitrov, Giovanni Da San Martino, Shaden Shaar, Hamed Firooz, and Preslav Nakov. 2021. https://doi.org/10.48550/arxiv.2103.12541 A survey on multimodal disinformation detection . arXiv
-
[6]
Liesbeth Allein, Isabelle Augenstein, and Marie-Francine Moens. 2020. https://doi.org/10.48550/arxiv.2009.06402 Time-aware evidence ranking for fact-checking . arXiv , 71:100663
-
[7]
Liesbeth Allein, Marlon Saelens, Ruben Cartuyvels, and Marie-Francine Moens. 2023. https://doi.org/10.48550/arxiv.2302.12569 Implicit temporal reasoning for evidence-based fact-checking . arXiv
-
[8]
Rami Aly, Zhijiang Guo, Michael Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. https://doi.org/10.48550/arxiv.2106.05707 FEVEROUS : Fact extraction and VERification over unstructured and structured information . arXiv . FEVEROUS
-
[9]
Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, and 1 others. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Anthropic. 2025. Claude 3.7 sonnet [large language model]. https://www.anthropic.com/claude/sonnet. Accessed: 2025-05-20
work page 2025
-
[11]
Phoebe Arnold. 2020. The challenges of online fact checking. Technical report, Technical report, Full Fact
work page 2020
-
[12]
Abolfazl Asudeh, H V Jagadish, You (Will) Wu, and Cong Yu. 2020. https://doi.org/10.14778/3380750.3380762 On detecting cherry-picked trendlines . Proceedings of the VLDB Endowment , 13(6):939--952
-
[13]
Satanjeev Banerjee and Alon Lavie. 2005. https://aclanthology.org/W05-0909 METEOR : An automatic metric for MT evaluation with improved correlation with human judgments . In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , pages 65--72, Ann Arbor, Michigan. Association for Comput...
work page 2005
-
[14]
Anab Maulana Barik, Wynne Hsu, and Mong Li Lee. 2024 a . https://doi.org/10.48550/arxiv.2410.14964 ChronoFact : Timeline-based temporal fact verification . arXiv . ChronoClaims
-
[15]
Anab Maulana Barik, Wynne Hsu, and Mong-Li Lee. 2024 b . https://doi.org/10.18653/v1/2024.emnlp-industry.48 Time matters: An end-to-end solution for temporal claim verification . Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 657--664. T- FEVER . T- FEVEROUS
-
[16]
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. 2022. https://doi.org/10.48550/arxiv.2209.03143 AudioLM : a language modeling approach to audio generation . arXiv
-
[17]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. https://doi.org/10.48550/arxiv.2303.12712 Sparks of artificial general intelligence: Early experiments with GPT -4 . arXiv
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.12712 2023
-
[18]
Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2019. https://doi.org/10.48550/arxiv.1909.02164 TabFact : A large-scale dataset for table-based fact verification . arXiv
-
[19]
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, and Ion Stoica. 2024. https://doi.org/10.48550/arxiv.2403.04132 Chatbot arena: An open platform for evaluating LLMs by human preference . arXiv
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.04132 2024
-
[20]
Eun Cheol Choi and Emilio Ferrara. 2023. https://doi.org/10.48550/arxiv.2310.09223 Automated claim matching with large language models: Empowering fact-checkers in the fight against misinformation . arXiv
-
[21]
Yiqun Duan, Jinzhao Zhou, Zhen Wang, Yu-Kai Wang, and Chin-Teng Lin. 2023. https://doi.org/10.48550/arxiv.2309.14030 DeWave : Discrete EEG waves encoding for brain dynamics to text translation . arXiv
-
[22]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, and 516 others. 2024. https://doi.org/10.48550/arxiv.2407.21783 The llama 3 ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[23]
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. 2023. https://doi.org/10.48550/arxiv.2305.18654 Faith and fate: Limits of transformers on compositionality . arXiv
-
[24]
Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, and Svitlana Vyetrenko. 2024. https://doi.org/10.48550/arxiv.2404.16563 Evaluating large language models on time series feature understanding: A comprehensive taxonomy and benchmark . arXiv
-
[25]
Nicolo' Fontana, Francesco Corso, Enrico Zuccolotto, and Francesco Pierri. 2025. https://doi.org/10.48550/arxiv.2503.05565 Evaluating open-source large language models for automated fact-checking . arXiv
-
[26]
Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. 2023. https://doi.org/10.48550/arxiv.2310.07820 Large language models are zero-shot time series forecasters . arXiv
-
[27]
Zihui Gu, Ju Fan, Nan Tang, Preslav Nakov, Xiaoman Zhao, and Xiaoyong Du. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.331 PASTA : Table-operations aware fact verification via sentence-table cloze pre-training . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4971--4983, Abu Dhabi, United Arab Emirates...
- [28]
-
[29]
Naeemul Hassan, Fatma Arslan, Chengkai Li, and Mark Tremayne. 2017. Toward automated fact-checking: Detecting check-worthy factual claims by claimbuster. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1803--1812
work page 2017
-
[30]
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. https://doi.org/10.48550/arxiv.1904.09751 The curious case of neural text degeneration . arXiv . Nucleus Sampling
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.09751 2019
-
[31]
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. https://doi.org/10.5281/zenodo.1212303 spaCy: Industrial-strength Natural Language Processing in Python
-
[32]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://doi.org/10.48550/arxi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
-
[33]
Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. 2023. https://doi.org/10.48550/arxiv.2310.01728 Time- LLM : Time series forecasting by reprogramming large language models . arXiv
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.01728 2023
-
[34]
Neema Kotonya and Francesca Toni. 2020. https://doi.org/10.48550/arxiv.2011.03870 Explainable automated fact-checking: A survey . arXiv
-
[35]
Miaoran Li, Baolin Peng, Michel Galley, Jianfeng Gao, and Zhu Zhang. 2023. https://doi.org/10.48550/arxiv.2305.14623 Self-checker: Plug-and-play modules for fact-checking with large language models . arXiv
-
[36]
Xinyuan Lu, Liangming Pan, Qian Liu, Preslav Nakov, and Min-Yen Kan. 2023. https://doi.org/10.48550/arxiv.2305.13186 SCITAB : A challenging benchmark for compositional reasoning and claim verification on scientific tables . arXiv
-
[37]
Mike A Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff. 2024. https://doi.org/10.48550/arxiv.2404.11757 Language models still struggle to zero-shot reason about time series . arXiv
-
[38]
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.741 FA ct S core: Fine-grained atomic evaluation of factual precision in long form text generation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...
-
[39]
OpenAI . 2023. https://doi.org/10.48550/arxiv.2303.08774 GPT -4 technical report . arXiv . GPT -4
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[40]
Nedjma Ousidhoum, Zhangdie Yuan, and Andreas Vlachos. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.163 Varifocal question generation for fact-checking . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2532--2544, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics
-
[41]
Justus J Randolph. 2005. Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss' fixed-marginal multirater kappa. Online submission
work page 2005
-
[42]
Daniel Russo, Serra Sinem Tekiroglu, and Marco Guerini. 2023. https://doi.org/10.48550/arxiv.2308.15202 Benchmarking the generation of fact checking explanations . arXiv
-
[43]
Michael Schlichtkrull, Zhijiang Guo, and Andreas Vlachos. 2023. https://doi.org/10.48550/arxiv.2305.13117 AVeriTeC : A dataset for real-world claim verification with evidence from the web . arXiv . AVeriTec
-
[44]
Dimitris Spathis and Fahim Kawsar. 2023. https://doi.org/10.48550/arxiv.2309.06236 The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models . arXiv
-
[45]
Marek Strong, Rami Aly, and Andreas Vlachos. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.991 Zero-shot fact verification via natural logic and large language models . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17021--17035, Miami, Florida, USA. Association for Computational Linguistics
-
[46]
Marek Strong, Jonas Rohnke, Antonio Bonafonte, Mateusz Łajszczak, and Trevor Wood. 2021. https://doi.org/10.48550/arxiv.2110.12539 Discrete acoustic space for an efficient sampling in neural text-to-speech . arXiv
-
[47]
Jannik Str \"o tgen and Michael Gertz. 2010. https://aclanthology.org/S10-1071 H eidel T ime: High quality rule-based extraction and normalization of temporal expressions . In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 321--324, Uppsala, Sweden. Association for Computational Linguistics
work page 2010
-
[48]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. https://doi.org/10.18653/v1/N18-1074 FEVER : a large-scale dataset for fact extraction and VER ification . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pap...
work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
-
[49]
Venktesh V, Abhijit Anand, Avishek Anand, and Vinay Setty. 2024. https://doi.org/10.48550/arxiv.2403.17169 QuanTemp : A real-world open-domain benchmark for fact-checking numerical claims . arXiv
-
[50]
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, and Marián Šimko. 2024. https://doi.org/10.48550/arxiv.2407.02351 Generative large language models in automated fact-checking: A survey . arXiv
-
[51]
Gengyu Wang, Kate Harwood, Lawrence Chillrud, Amith Ananthram, Melanie Subbiah, and Kathleen McKeown . 2023. https://doi.org/10.48550/arxiv.2305.18265 Check- COVID : Fact-checking COVID -19 news claims with scientific evidence . arXiv
-
[52]
Greta Warren, Irina Shklovski, and Isabelle Augenstein. 2025. https://doi.org/10.48550/arxiv.2502.09083 Show me the work: Fact-checkers' requirements for explainable automated fact-checking . arXiv
-
[53]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2022. https://doi.org/10.48550/arxiv.2201.11903 Chain of thought prompting elicits reasoning in large language models . arXiv
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11903 2022
-
[54]
Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2024. https://doi.org/10.48550/arxiv.2402.02592 Unified training of universal time series forecasting transformers . arXiv
-
[55]
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. https://doi.org/10.48550/arxiv.2305.11000 SpeechGPT : Empowering large language models with intrinsic cross-modal conversational abilities . arXiv
-
[56]
Yilun Zhao, Yitao Long, Yuru Jiang, Chengye Wang, Weiyuan Chen, Hongjun Liu, Yiming Zhang, Xiangru Tang, Chen Zhao, and Arman Cohan. 2024. https://doi.org/10.48550/arxiv.2411.05764 FinDVer : Explainable claim verification over long and hybrid-content financial documents . arXiv
-
[57]
Tian Zhou, PeiSong Niu, Xue Wang, Liang Sun, and Rong Jin. 2023. https://doi.org/10.48550/arxiv.2302.11939 One fits all:power general time series analysis by pretrained LM . arXiv
-
[58]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[59]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.