pith. machine review for the scientific record. sign in

arxiv: 2604.23051 · v1 · submitted 2026-04-24 · 💻 cs.CL

Recognition: unknown

Evaluating Temporal Consistency in Multi-Turn Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords temporal consistencymulti-turn dialoguelanguage modelsbenchmarktemporal scopeWikidataconversational AIfactuality
0
0 comments X

The pith

Language models frequently violate temporal scope stability in multi-turn conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models must keep track of time-related facts established early in a conversation when users ask follow-up questions without repeating the time references. This paper creates ChronoScope, a benchmark with over one million question chains from Wikidata, to test whether models can carry over, switch, or transfer temporal scopes across turns. Evaluations of current models show they often drift toward assuming the present day despite having the right underlying facts. These problems grow worse with longer conversations and remain even when models are given complete context about prior turns.

Core claim

We introduce ChronoScope, a large-scale benchmark of deterministically generated Wikidata question chains, and find that temporal scope stability—the ability to preserve, override, or transfer time-scoped factual context across dialogue turns—is frequently violated by state-of-the-art language models in controlled multi-turn settings, with models drifting toward present-day assumptions despite correct underlying knowledge; these failures intensify with interaction length and persist even under oracle context conditions.

What carries the argument

ChronoScope benchmark of over one million deterministically generated Wikidata question chains that isolate implicit carryover, explicit scope switching, cross-entity transfer, and longer temporal trajectories in multi-turn interactions.

If this is right

  • Single-turn factual accuracy does not ensure coherent temporal reasoning once questions become sequential.
  • Temporal scope failures become more severe as the number of dialogue turns grows.
  • The problem continues even when models receive complete oracle context for the full conversation history.
  • Current models exhibit a measurable gap between isolated fact retrieval and maintaining consistent time assumptions in interactive use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications such as timeline-based assistants or historical query systems may need dedicated temporal tracking components to prevent inconsistent answers over extended sessions.
  • The same benchmark structure could be adapted to measure consistency in other dimensions, such as location or entity state across turns.
  • Training methods that explicitly reward preservation of earlier temporal assumptions might reduce the drift observed in longer chains.
  • Testing the benchmark on real user-generated multi-turn dialogues would show whether the controlled failures appear in natural settings.

Load-bearing premise

The deterministically generated Wikidata question chains isolate temporal scope behavior without other factors like phrasing or entity linking artifacts driving the observed drift.

What would settle it

A replication in which models maintain correct temporal scopes across long interaction trajectories in the ChronoScope benchmark, showing no increase in present-day drift, would falsify the reported frequent violations.

Figures

Figures reproduced from arXiv: 2604.23051 by Steven L. Johnson, Tom Hartvigsen, Yash Kumar Atri.

Figure 1
Figure 1. Figure 1: Illustration of temporal scope drift in multi view at source ↗
read the original abstract

Language models are increasingly deployed in interactive settings where users reason about facts over time rather than in isolation. In such scenarios, correct behavior requires models to maintain and update implicit temporal assumptions established earlier in a conversation. We study this challenge through the lens of temporal scope stability: the ability to preserve, override, or transfer time-scoped factual context across dialogue turns. We introduce ChronoScope, a large-scale diagnostic benchmark designed to isolate temporal scope behavior in controlled multi-turn interactions, comprising over one million deterministically generated question chains grounded in Wikidata. ChronoScope evaluates whether models can correctly retain inferred temporal scope when follow-up questions omit explicit time references, spanning implicit carryover, explicit scope switching, cross-entity transfer, and longer temporal trajectories. Through extensive evaluation of state-of-the-art language models, we find that temporal scope stability is frequently violated in controlled multi-turn settings, with models often drifting toward present-day assumptions despite correct underlying knowledge. These failures intensify with interaction length and persist even under oracle context conditions, revealing a gap between single-turn factual accuracy and coherent temporal reasoning under sequential interaction. We make our dataset and evaluation suite publicly available at https://github.com/yashkumaratri/ChronoScope

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ChronoScope, a benchmark of over one million deterministically generated Wikidata-grounded multi-turn question chains, to evaluate temporal scope stability in language models. It claims that state-of-the-art models frequently violate this stability by drifting toward present-day assumptions in controlled multi-turn settings, even with correct underlying knowledge and oracle context, with failures intensifying as interaction length increases.

Significance. If the benchmark cleanly isolates temporal scope behavior, the findings would identify a meaningful gap between single-turn factual accuracy and coherent multi-turn temporal reasoning, with direct relevance to interactive applications. The public release of the dataset and evaluation suite is a clear strength that supports reproducibility and follow-on work.

major comments (2)
  1. [Abstract and ChronoScope construction] Abstract and ChronoScope construction: the central claim that models violate temporal scope stability (and that failures persist under oracle context) requires that the >1M question chains isolate implicit carryover, scope switching, and cross-entity transfer without confounding factors. The deterministic generation procedure (omitting explicit time references while constructing follow-ups) is not shown to rule out present-tense phrasing artifacts or Wikidata entity-linking inconsistencies that could produce the observed drift; this is load-bearing for interpreting the results as reasoning failures rather than benchmark biases.
  2. [Evaluation and results sections] Evaluation and results sections: the manuscript reports extensive evaluation and public release but provides no tables, per-model breakdowns, error analysis, or quantitative trends with interaction length in the reviewed materials. Without these, the magnitude, consistency, and statistical significance of the reported violations cannot be verified.
minor comments (2)
  1. [Abstract] The abstract states that the benchmark spans 'implicit carryover, explicit scope switching, cross-entity transfer, and longer temporal trajectories' but does not define these categories with example chains or metrics.
  2. [Public release statement] Ensure the released GitHub repository includes the exact Wikidata query templates and chain-generation code so that the deterministic construction can be inspected for phrasing biases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of benchmark validity and result presentation. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract and ChronoScope construction: the central claim that models violate temporal scope stability (and that failures persist under oracle context) requires that the >1M question chains isolate implicit carryover, scope switching, and cross-entity transfer without confounding factors. The deterministic generation procedure (omitting explicit time references while constructing follow-ups) is not shown to rule out present-tense phrasing artifacts or Wikidata entity-linking inconsistencies that could produce the observed drift; this is load-bearing for interpreting the results as reasoning failures rather than benchmark biases.

    Authors: The deterministic generation procedure isolates temporal scope behavior by construction: all chains are derived from Wikidata triples with explicit temporal annotations (e.g., start/end dates), using fixed neutral templates that contain no tense markers or present-day defaults. Entity references rely on stable Wikidata QIDs from a single snapshot, eliminating linking inconsistencies. We have performed internal validation (manual review of 500 chains plus automated checks for phrasing patterns) confirming no artifacts. We will add a dedicated subsection with generation pseudocode, example chains, and validation statistics to make this explicit. revision: partial

  2. Referee: Evaluation and results sections: the manuscript reports extensive evaluation and public release but provides no tables, per-model breakdowns, error analysis, or quantitative trends with interaction length in the reviewed materials. Without these, the magnitude, consistency, and statistical significance of the reported violations cannot be verified.

    Authors: The full evaluation suite includes per-model tables, categorized error analysis (e.g., present-drift vs. carryover failure), and plots of violation rates versus turn count with significance testing. These appear to have been omitted from the reviewed materials. We will incorporate the key tables, breakdowns, and trend figures directly into the main text of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark study with direct measurements

full rationale

This is an empirical benchmark paper that introduces ChronoScope consisting of over one million deterministically generated Wikidata question chains and reports model evaluation results on temporal scope stability. There are no derivations, equations, fitted parameters, predictions, or self-citations that reduce any central claim to its own inputs by construction. All findings rest on direct measurement against the generated data rather than self-referential logic, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that Wikidata-grounded deterministic chains isolate temporal scope without confounding variables; no free parameters or invented physical entities are used.

axioms (1)
  • domain assumption Deterministic generation from Wikidata produces question chains that cleanly isolate temporal scope behavior
    Invoked in the benchmark construction section implied by the abstract.
invented entities (1)
  • ChronoScope benchmark no independent evidence
    purpose: Diagnostic tool for temporal scope stability
    New dataset introduced by the authors with no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5509 in / 1177 out tokens · 53111 ms · 2026-05-08T11:32:46.501890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 43 canonical work pages · 14 internal anchors

  1. [1]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, and 110 others. 2024 a . https://arxiv.org/abs/2404.14219 Phi-3 technical report: A...

  2. [2]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, and 8 others. 2024 b . https://arxiv.org/abs/2412.08905 Phi-4 tech...

  3. [3]

    Lisa Alazraki, Maximilian Mozes, Jon Ander Campos, Tan Yi-Chern, Marek Rei, and Max Bartolo. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1686 No need for explanations: LLM s can implicitly learn from mistakes in-context . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33191--33215, Suzhou, China. Ass...

  4. [4]

    Yash Kumar Atri, Arun Iyer, Tanmoy Chakraborty, and Vikram Goyal. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.133 Promoting topic coherence and inter-document consorts in multi-document summarization via simplicial complex and sheaf graph . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2154--2166, S...

  5. [5]

    Eryk Banatt, Jonathan Cheng, Skanda Vaidyanath, and Tiffany Hwu. 2024. https://arxiv.org/abs/2410.10998 Wilt: A multi-turn, memorization-robust inductive logic benchmark for llms . Preprint, arXiv:2410.10998

  6. [6]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. https://arxiv.org/abs/2005.14165 Lan...

  7. [7]

    Dingyang Chen, Qi Zhang, and Yinglun Zhu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.517 Efficient sequential decision making with large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9157--9170, Miami, Florida, USA. Association for Computational Linguistics

  8. [8]

    Wenhu Chen, Xinyi Wang, and William Yang Wang. 2021. https://arxiv.org/abs/2108.06314 A dataset for answering time-sensitive questions . Preprint, arXiv:2108.06314

  9. [9]

    Yize Cheng, Arshia Soltani Moakhar, Chenrui Fan, Kazem Faghih, Parsa Hosseini, Wenxiao Wang, and Soheil Feizi. 2025. https://arxiv.org/abs/2510.23853 Temporal blindness in multi-turn llm agents: Misaligned tool use vs. human time perception . Preprint, arXiv:2510.23853

  10. [10]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, and 3416 others. 2025. https://arxiv.org/abs/2507.06261 Gemini 2.5: Pus...

  11. [11]

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 181 others. 2025. https://arxiv.org/abs/2412.19437 Deepseek-v3 technical report . Preprint, arXiv:2412.19437

  12. [12]

    Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, and Javier Gonzalvo. 2025. https://arxiv.org/abs/2507.16003 Learning without training: The implicit dynamics of in-context learning . Preprint, arXiv:2507.16003

  13. [13]

    and Eisenschlos, Julian Martin and Gillick, Daniel and Eisenstein, Jacob and Cohen, William W

    Bhuwan Dhingra, Jeremy R. Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, and William W. Cohen. 2022. https://doi.org/10.1162/tacl_a_00459 Time-aware language models as temporal knowledge bases . Transactions of the Association for Computational Linguistics, 10:257–273

  14. [14]

    Xi Ding and Lei Wang. 2025. https://doi.org/10.1145/3701716.3717744 Do language models understand time? In Companion Proceedings of the ACM on Web Conference 2025, WWW '25, page 1855–1868, New York, NY, USA. Association for Computing Machinery

  15. [15]

    Zifeng Ding, Sikuan Yan, Moy Yuan, Xianglong Hu, Fangru Lin, and Andreas Vlachos. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1142 TCP : a benchmark for temporal constraint-based planning . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22452--22475, Suzhou, China. Association for Computational Linguistics

  16. [16]

    Alina Fastowski and Gjergji Kasneci. 2025. https://doi.org/10.1007/978-3-031-82346-6_5 Understanding knowledge drift in llms through misinformation . In Discovering Drift Phenomena in Evolving Landscapes: First International Workshop, DELTA 2024, Barcelona, Spain, August 26, 2024, Proceedings, page 74–85, Berlin, Heidelberg. Springer-Verlag

  17. [17]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

  18. [18]

    Weiyang Guo, Jing Li, Wenya Wang, Yu Li, Daojing He, Jun Yu, and Min Zhang. 2025. https://doi.org/10.18653/v1/2025.acl-long.1282 MTSA : Multi-turn safety alignment for LLM s through multi-round red-teaming . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26424--26442, Vienna, Austr...

  19. [19]

    Zhen Jia, Abdalghani Abujabal, Rishiraj Saha Roy, Jannik Str\" o tgen, and Gerhard Weikum. 2018. https://doi.org/10.1145/3184558.3191536 Tempquestions: A benchmark for temporal question answering . In Companion Proceedings of the The Web Conference 2018, WWW '18, page 1057–1062, Republic and Canton of Geneva, CHE. International World Wide Web Conferences ...

  20. [20]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...

  21. [21]

    Kamp and U

    H. Kamp and U. Reyle. 1993. https://books.google.com/books?id=EjdPvQEACAAJ From Discourse to Logic: Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory . Number pt. 2 in Developments in Cardiovascular Medicine. Kluwer Academic

  22. [22]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. https://arxiv.org/abs/2505.06120 Llms get lost in multi-turn conversation . Preprint, arXiv:2505.06120

  23. [23]

    Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. 2021. https://arxiv.org/abs/2102.01951 Mind the gap: Assessing temporal generalization in neural language models . Preprint, a...

  24. [24]

    Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019. https://doi.org/10.18653/v1/P19-1129 Entity-relation extraction as multi-turn question answering . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1340--1350, Florence, Italy. Association for Computational Li...

  25. [25]

    Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, and Rema Padman. 2025. https://doi.org/10.18653/v1/2025.findings-acl.347 Firm or fickle? evaluating large language models consistency in sequential interactions . In Findings of the Association for Computational Linguistics: ACL 2025, pages 6679--6700, Vienna, Austria. Association for Computational Linguistics

  26. [26]

    Zijia Liu, Peixuan Han, Haofei Yu, Haoru Li, and Jiaxuan You. 2025. https://arxiv.org/abs/2505.13508 Time-r1: Towards comprehensive temporal reasoning in llms . Preprint, arXiv:2505.13508

  27. [27]

    Javier Marín. 2025. https://arxiv.org/abs/2511.10654 Empirical characterization of temporal constraint processing in llms . Preprint, arXiv:2511.10654

  28. [28]

    Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, and 64 others. 2024. https://arxiv.org/abs/2406.11704 Nemotron-4 340b technical report . Pre...

  29. [29]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

  30. [30]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

  31. [31]

    Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. https://doi.org/10.1162/tacl_a_00266 C o QA : A conversational question answering challenge . Transactions of the Association for Computational Linguistics, 7:249--266

  32. [32]

    Hans Reichenbach. 1947. Elements of symbolic logic / by Hans Reichenbach. Macmillan Co., New York

  33. [33]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 others. 2025. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267

  34. [34]

    Hui Su, Xiaoyu Shen, Rongzhi Zhang, Fei Sun, Pengwei Hu, Cheng Niu, and Jie Zhou. 2019. https://doi.org/10.18653/v1/P19-1003 Improving multi-turn dialogue modelling with utterance R e W riter . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 22--31, Florence, Italy. Association for Computational Linguistics

  35. [35]

    Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. 2019. https://doi.org/10.1162/tacl_a_00264 DREAM : A challenge data set and models for dialogue-based reading comprehension . Transactions of the Association for Computational Linguistics, 7:217--231

  36. [36]

    Yuchong Sun, Che Liu, Kun Zhou, Jinwen Huang, Ruihua Song, Xin Zhao, Fuzheng Zhang, Di Zhang, and Kun Gai. 2024. https://doi.org/10.18653/v1/2024.acl-long.525 Parrot: Enhancing multi-turn instruction following for large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pag...

  37. [37]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

  38. [38]

    Denny Vrande c i\' c and Markus Kr\" o tzsch. 2014. https://doi.org/10.1145/2629489 Wikidata: a free collaborative knowledgebase . Commun. ACM, 57(10):78–85

  39. [39]

    Pan, and Kam-Fai Wong

    Hongru Wang, Wenyu Huang, Yufei Wang, Yuanhao Xi, Jianqiao Lu, Huan Zhang, Nan Hu, Zeming Liu, Jeff Z. Pan, and Kam-Fai Wong. 2025. https://doi.org/10.18653/v1/2025.findings-acl.284 Rethinking stateful tool use in multi-turn dialogues: Benchmarks and challenges . In Findings of the Association for Computational Linguistics: ACL 2025, pages 5433--5453, Vie...

  40. [40]

    Jiexin Wang, Adam Jatowt, and Masatoshi Yoshikawa. 2022. https://doi.org/10.1145/3477495.3531734 Archivalqa: A large-scale benchmark dataset for open-domain question answering over historical news collections . In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, page 3025–3035, New...

  41. [41]

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. 2023. https://arxiv.org/abs/2309.10691 Mint: Evaluating llms in multi-turn interaction with tools and language feedback . Preprint, arXiv:2309.10691

  42. [42]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  43. [43]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

  44. [44]

    Jiahao Zhang, Haiyang Zhang, Dongmei Zhang, Liu Yong, and Shen Huang. 2024. https://doi.org/10.18653/v1/2024.naacl-long.96 End-to-end beam retrieval for multi-hop question answering . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pa...

  45. [45]

    Chenghao Zhu, Nuo Chen, Yufei Gao, Yunyi Zhang, Prayag Tiwari, and Benyou Wang. 2025. https://doi.org/10.18653/v1/2025.naacl-long.381 Is your LLM outdated? a deep look at temporal generalization . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volu...

  46. [46]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  47. [47]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...