pith. sign in

arxiv: 2510.23853 · v3 · submitted 2025-10-27 · 💻 cs.CL

Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

Pith reviewed 2026-05-18 03:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords temporal blindnessLLM agentstool usehuman temporal perceptionTicToc datasetelapsed timemulti-turn interactionsalignment
0
0 comments X

The pith

LLM agents are temporally blind and align with human preferences on tool calls at rates no better than 65 percent even when given timestamps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLM agents default to assuming a stationary context and therefore do not factor in real-world time elapsed between messages when deciding whether to invoke tools. To measure the resulting misalignment, the authors created the TicToc dataset of 76 multi-turn scenarios spanning high, medium, and low time sensitivity and gathered human preference labels for calling a tool versus answering directly. Existing models were then tested on these scenarios with and without explicit time information, revealing that no model exceeds a normalized alignment rate of 65 percent. The work further shows that simple prompting strategies offer little improvement for most models while certain post-training methods can raise alignment. This matters for any deployment in which agents must act on changing information over time rather than on a frozen snapshot.

Core claim

LLM agents suffer from temporal blindness because they treat context as stationary and therefore fail to incorporate elapsed real-world time when choosing between tool invocation and direct answers; human preference data on the TicToc dataset of 76 scenarios shows that current models achieve at most 65 percent normalized alignment even when timestamps are supplied, though targeted post-training can improve this match.

What carries the argument

The TicToc dataset of 76 multi-turn trajectories with human preference labels for tool calling versus direct answering under varying elapsed times.

If this is right

  • Agents that ignore elapsed time will either act on stale information and skip needed tool calls or waste effort by repeating calls unnecessarily.
  • Supplying timestamps in prompts produces only limited gains in decision quality for most existing models.
  • Post-training on time-aware preference data can raise alignment between multi-turn tool-use decisions and human temporal expectations.
  • Dynamic environments require agents that treat time progression as an explicit input rather than an implicit afterthought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world deployments in monitoring, personal assistance, or live customer service may need explicit time-modeling layers beyond current post-training fixes.
  • The misalignment could worsen over longer sessions, suggesting tests on extended conversation lengths to measure compounding errors.
  • Architectural additions such as internal clock states or time embeddings might be explored as alternatives to preference tuning alone.

Load-bearing premise

The human preference labels collected on the TicToc dataset of 76 scenarios constitute a reliable and representative ground truth for how humans expect agents to handle elapsed time in tool-use decisions.

What would settle it

New human labels collected on a fresh set of time-sensitive scenarios where a model given only timestamps reaches over 80 percent alignment with those labels.

Figures

Figures reproduced from arXiv: 2510.23853 by Arshia Soltani Moakhar, Chenrui Fan, Kazem Faghih, Parsa Hosseini, Soheil Feizi, Wenxiao Wang, Yize Cheng, Zahra Sodagar.

Figure 1
Figure 1. Figure 1: Illustrative examples showing the liability of temporal blindness in multi-turn LLM agents, in comparison to human. The first row shows the case of over-reliance, where the model displays over-confidence in outdated context, resulting in erroneous outputs. The second row shows the case of under-reliance, where the model displays excessive caution through repeated tool calls, resulting in unnecessary delays… view at source ↗
Figure 2
Figure 2. Figure 2: Normalized preference alignment rate of all models with and without timestamps. Without timestamps, models perform only slightly above random (max alignment marginally exceed￾ing 60%). With timestamps, larger commercial models improve modestly, peaking just over 65%. It can be seen that without temporal information, most models perform only slightly better than random guessing, with the highest normalized … view at source ↗
Figure 3
Figure 3. Figure 3: Model attempt rates for both prefer-Tool and prefer-noTool cases. Without timestamps, models show varying tool-use biases; with timestamps, attempt rates rise for both classes, indicating limited alignment with human-like temporal awareness. further indicates that current models struggle to effectively exploit temporal information and fail to align their tool-use decisions with human time perception. 4.2 T… view at source ↗
Figure 4
Figure 4. Figure 4: Normalized prefer￾ence alignment rate of Qwen￾3-8B with and without long CoT reasoning, under set￾tings of both with and with￾out timestamp. Long CoT reasoning shows no improve￾ment in tool-use alignment with human time perception. benefits in areas such as robustness and security [42]. In this section, we investigate whether long CoT reasoning can enhance models’ ability to leverage temporal information a… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt used to explicitly align models’ tool-use decisions with human expectations. Few-shot example rules are used to illustrate when tool calls are appropriate or unnecessary depending on elapsed time, providing models with explicit guidance. Note that the scenarios mentioned in the prompt do not overlap with our coverage in TicToc-v1. Ministral-8B Llama-3.1-8B Qwen-2.5-7B Llama-3.2-3B gpt-4.1-nano gpt-4… view at source ↗
Figure 6
Figure 6. Figure 6: Normalized preference alignment rate of all models with and without prompting-based alignment efforts. Detailed prompting with few-shot example rules yields notable boost in alignment rate for advanced reasoning models (o3, o4-mini), but has little effect on most others. tool-use decisions in temporally dynamic environments requires targeted post-training rather than prompt engineering alone. 5 Conclusion,… view at source ↗
Figure 7
Figure 7. Figure 7: The prompt used for generating read-only samples with an in-context example. To fully leverage the instruction following capability of the model, we generate the samples for one variant at a time by passing in a preferred_strategy. For verification, we employed GPT-4.1 as an initial LLM-as-judge filter prior to human inspection. The prompt specifying the verification rules provided to the judge model is sh… view at source ↗
Figure 8
Figure 8. Figure 8: The prompt used for generating read+write samples with an in-context example. To fully leverage the instruction following capability of the model, we generate the samples for one variant at a time by passing in a preferred_strategy. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The prompt we use to verify samples with LLM judge. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example trajectories with timestamps for both [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Large language model (LLM) agents are increasingly used to interact with and execute tasks in dynamic environments. However, a critical yet overlooked limitation of these agents is that they, by default, assume a stationary context, failing to account for the real-world time elapsed between messages. We refer to this as "temporal blindness". This limitation hinders decisions about when to invoke tools, leading agents to either over-rely on stale context and skip needed tool calls, or under-rely on it and redundantly repeat tool calls. To study this challenge, we constructed TicToc, a diverse dataset of multi-turn user-agent message trajectories across 76 scenarios, spanning dynamic environments with high, medium, and low time sensitivity. We collected human preferences between "calling a tool" and "directly answering" on each sample, and evaluated how well LLM tool-calling decisions align with human preferences under varying amounts of elapsed time. Our analysis reveals that existing models display poor alignment with human temporal perception, with no model achieving a normalized alignment rate better than 65% when given time stamp information. We also show that naive, prompt-based alignment techniques have limited effectiveness for most models, but specific post-training alignment can be a viable way to align multi-turn LLM tool use with human temporal perception. Our data and findings provide a first step toward understanding and mitigating temporal blindness, offering insights to foster the development of more time-aware and human-aligned agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLM agents exhibit 'temporal blindness' by defaulting to stationary context and failing to account for elapsed real-world time in multi-turn tool-use decisions. To quantify this, the authors introduce the TicToc dataset of 76 scenarios spanning high/medium/low time sensitivity, collect human preference labels for tool-calling versus direct answering, and evaluate model alignment with these labels (with and without explicit timestamps). They report that no evaluated model exceeds 65% normalized alignment and that prompt-based interventions are largely ineffective while post-training alignment offers a viable path forward.

Significance. If the human preference labels prove reliable, the work identifies a concrete, practically relevant gap in current agent tool-use behavior and supplies an initial benchmark (TicToc) plus dataset for future time-aware systems. The empirical framing and release of the preference data are constructive contributions that could support reproducible follow-up work on temporal reasoning in agents.

major comments (2)
  1. [Methodology] Methodology / data collection section: the central 65% normalized alignment ceiling rests on human preference labels collected over the 76 TicToc scenarios, yet no inter-annotator agreement, number of annotators, annotator demographics, or consistency checks are reported. Without these, it is impossible to determine whether the measured misalignment reflects model limitations or properties of the label collection process.
  2. [Results] Results section: the paper states a 'normalized alignment rate' but provides no definition, no per-scenario breakdown, no statistical significance tests, and no confidence intervals around the 65% figure. These omissions make it difficult to assess the robustness of the claim that no model exceeds this threshold.
minor comments (2)
  1. [Experimental setup] Clarify the exact procedure for injecting timestamps into the model prompts and how 'elapsed time' is operationalized across the high/medium/low sensitivity categories.
  2. [Introduction] The abstract and introduction introduce the term 'temporal blindness' without a precise operational definition; a short formalization would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Methodology] Methodology / data collection section: the central 65% normalized alignment ceiling rests on human preference labels collected over the 76 TicToc scenarios, yet no inter-annotator agreement, number of annotators, annotator demographics, or consistency checks are reported. Without these, it is impossible to determine whether the measured misalignment reflects model limitations or properties of the label collection process.

    Authors: We agree that these details are necessary to establish the reliability of the human preference labels. The original submission omitted a full description of the annotation protocol. In the revised manuscript we will expand the methodology section with the number of annotators, their relevant demographics and qualifications, inter-annotator agreement statistics, and the consistency checks that were performed. These additions will allow readers to better evaluate whether the reported misalignment stems from model behavior rather than label variability. revision: yes

  2. Referee: [Results] Results section: the paper states a 'normalized alignment rate' but provides no definition, no per-scenario breakdown, no statistical significance tests, and no confidence intervals around the 65% figure. These omissions make it difficult to assess the robustness of the claim that no model exceeds this threshold.

    Authors: We accept that the results presentation requires greater transparency. We will add an explicit definition of the normalized alignment rate, include a per-scenario breakdown (in an appendix or supplementary table), report statistical significance tests for model comparisons, and provide bootstrap-derived confidence intervals around the key figures including the 65% ceiling. These changes will strengthen the empirical support for our conclusions while leaving the overall findings unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark against external human labels

full rationale

The paper constructs the TicToc dataset of 76 scenarios, collects independent human preference labels on tool-calling vs. direct answering under elapsed time, and reports model alignment rates against those labels. No equations, fitted parameters, derivations, or self-citation chains are used to generate the central 65% alignment figure; the result is a direct comparison to externally collected judgments rather than a quantity forced by the models' own outputs or prior author work. The methodology is self-contained against the human benchmark and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that collected human preferences are a valid proxy for correct time-aware behavior and that the 76 scenarios adequately sample dynamic environments.

axioms (1)
  • domain assumption Human preferences between tool calling and direct answering on TicToc samples accurately reflect desirable agent behavior under elapsed time
    The entire evaluation framework treats these preferences as ground truth against which models are scored.
invented entities (1)
  • temporal blindness no independent evidence
    purpose: Conceptual label for the failure to account for real-world time elapsed between messages
    New term introduced to name the observed limitation; no independent empirical handle provided beyond the benchmark itself.

pith-pipeline@v0.9.0 · 5822 in / 1221 out tokens · 32960 ms · 2026-05-18T03:43:00.174363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

    cs.AI 2026-05 unverdicted novelty 7.0

    Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.

  2. Evaluating Temporal Consistency in Multi-Turn Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.

  3. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 2 Pith papers · 10 internal anchors

  1. [1]

    A. Abid, M. Farooqi, and J. Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, page 298–306, New York, NY , USA, 2021. Association for Computing Machinery

  2. [2]

    Introducing the model context protocol, November 2024

    Anthropic. Introducing the model context protocol, November 2024

  3. [3]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan....

  4. [4]

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries, 2024

  5. [5]

    Z. Chu, J. Chen, Q. Chen, W. Yu, H. Wang, M. Liu, and B. Qin. Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models, 2024

  6. [6]

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

  7. [7]

    Gaming tool preferences in agentic llms

    K. Faghih, W. Wang, Y . Cheng, S. Bharti, G. Sriramanan, S. Balasubramanian, P. Hosseini, and S. Feizi. Gaming tool preferences in agentic llms.arXiv preprint arXiv:2505.18135, 2025

  8. [8]

    Fatemi, M

    B. Fatemi, M. Kazemi, A. Tsitsulin, K. Malkan, J. Yim, J. Palowitch, S. Seo, J. Halcrow, and B. Perozzi. Test of time: A benchmark for evaluating LLMs on temporal reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

  9. [9]

    Y . Ge, S. Romeo, J. Cai, R. Shu, M. Sunkara, Y . Benajiba, and Y . Zhang. Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues, 2025

  10. [10]

    Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

    Google. Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

  11. [11]

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    Gupta, P

    V . Gupta, P. Kandoi, M. B. V ora, S. Zhang, Y . He, R. Reinanda, and V . Srikumar. Temptabqa: Temporal question answering for semi-structured tables, 2023

  13. [13]

    Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

    Y . Huang, J. Shi, Y . Li, C. Fan, S. Wu, Q. Zhang, Y . Liu, P. Zhou, Y . Wan, N. Z. Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

  14. [14]

    OpenAI o1 System Card

    A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  15. [15]

    M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li. Api-bank: A comprehensive benchmark for tool-augmented llms.arXiv preprint arXiv:2304.08244, 2023

  16. [16]

    Z. Liu, P. Han, H. Yu, H. Li, and J. You. Time-r1: Towards comprehensive temporal reasoning in llms, 2025

  17. [17]

    The llama 3 herd of models, 2024

    Llama Team, AI at Meta. The llama 3 herd of models, 2024. 10

  18. [18]

    Lucy and D

    L. Lucy and D. Bamman. Gender and representation bias in GPT-3 generated stories. In N. Ak- oury, F. Brahman, S. Chaturvedi, E. Clark, M. Iyyer, and L. J. Martin, editors,Proceedings of the Third Workshop on Narrative Understanding, pages 48–55, Virtual, June 2021. Association for Computational Linguistics

  19. [19]

    Augmented Language Models: a Survey

    G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

  20. [20]

    Un ministral, des ministraux: Introducing the world’s best edge models, Oct

    Mistral AI Team. Un ministral, des ministraux: Introducing the world’s best edge models, Oct. 2024

  21. [21]

    s1: Simple test-time scaling

    N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

  22. [22]

    Gpt-4 technical report, 2024

    OpenAI. Gpt-4 technical report, 2024

  23. [23]

    Introducing OpenAI o3 and o4-mini, 2025

    OpenAI. Introducing OpenAI o3 and o4-mini, 2025

  24. [24]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  25. [25]

    Talm: Tool augmente d language models

    A. Parisi, Y . Zhao, and N. Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

  26. [26]

    S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning

  27. [27]

    Y . Qin, K. Song, Y . Hu, W. Yao, S. Cho, X. Wang, X. Wu, F. Liu, P. Liu, and D. Yu. Infobench: Evaluating instruction following ability in large language models, 2024

  28. [28]

    Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...

  29. [29]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

  30. [30]

    H. Ross, A. S. Mahabaleshwarkar, and Y . Suhara. When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025

  31. [31]

    J. B. Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools, 2023

  32. [32]

    Schick, J

    T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

  33. [33]

    T. Shen, R. Jin, Y . Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y . Liu, and D. Xiong. Large language model alignment: A survey.arXiv preprint arXiv:2309.15025, 2023

  34. [34]

    Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023

  35. [35]

    J. Shi, Z. Yuan, G. Tie, P. Zhou, N. Z. Gong, and L. Sun. Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

  36. [36]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.11366, 1, 2023. 11

  37. [37]

    X. Song, B. Liang, Y . Sun, C. Zhang, B. Wang, and R. Xu. Bridging time gaps: Temporal logic relations for enhancing temporal reasoning in large language models. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’25, page 3040–3044, New York, NY , USA, 2025. Association for Computing Machinery

  38. [38]

    Y . Song, W. Xiong, D. Zhu, W. Wu, H. Qian, M. Song, H. Huang, C. Li, K. Wang, R. Yao, et al. Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

  39. [39]

    Z. Su, J. Zhang, T. Zhu, X. Qu, J. Li, M. Zhang, and Y . Cheng. Timo: Towards better temporal reasoning for language models, 2024

  40. [40]

    S. M. T. I. Tonmoy, S. M. M. Zaman, V . Jain, A. Rani, V . Rawte, A. Chadha, and A. Das. A comprehensive survey of hallucination mitigation techniques in large language models, 2024

  41. [41]

    G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research

  42. [42]

    W. Wang, P. Hosseini, and S. Feizi. Chain-of-defensive-thought: Structured reasoning elicits robustness in large language models against reference corruption.arXiv preprint arXiv:2504.20769, 2025

  43. [43]

    X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations

  44. [44]

    Wang and Y

    Y . Wang and Y . Zhao. Tram: Benchmarking temporal reasoning for large language models, 2024

  45. [45]

    Y . Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu. Aligning large language models with human: A survey, 2023

  46. [46]

    Z. Wang, B. Bi, S. K. Pentyala, K. Ramnath, S. Chaudhuri, S. Mehrotra, Zixu, Zhu, X.-B. Mao, S. Asur, Na, and Cheng. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024

  47. [47]

    Z. Wang, S. Cai, G. Chen, A. Liu, X. Ma, and Y . Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents.arXiv preprint arXiv:2302.01560, 2023

  48. [48]

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

  49. [49]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  50. [50]

    Xiong, A

    S. Xiong, A. Payani, R. Kompella, and F. Fekri. Large language models can learn temporal reasoning, 2024

  51. [51]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  52. [52]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

  53. [53]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  54. [54]

    C. Yuan, Q. Xie, J. Huang, and S. Ananiadou. Back to the future: Towards explainable temporal reasoning with large language models, 2023. 12

  55. [55]

    Zhang, H

    P. Zhang, H. Zhang, X. Wang, F. Zhang, and F. Yu. A brief survey on temporal reasoning based on large language models. In2024 8th Asian Conference on Artificial Intelligence Technology (ACAIT), pages 7–11, 2024

  56. [56]

    Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

    Y . Zhang, J. Chen, J. Wang, Y . Liu, C. Yang, C. Shi, X. Zhu, Z. Lin, H. Wan, Y . Yang, et al. Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

  57. [57]

    try again now

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models, 2023. 13 AGPTPrompts Used for Trajectory Creation and Verification To generate trajectories, we first manually authored a single exemplar multi-turn trajectory and used it as an in-context example to prompt GPT-4o to p...