Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

Arshia Soltani Moakhar; Chenrui Fan; Kazem Faghih; Parsa Hosseini; Soheil Feizi; Wenxiao Wang; Yize Cheng; Zahra Sodagar

arxiv: 2510.23853 · v3 · submitted 2025-10-27 · 💻 cs.CL

Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception

Yize Cheng , Arshia Soltani Moakhar , Chenrui Fan , Parsa Hosseini , Kazem Faghih , Zahra Sodagar , Wenxiao Wang , Soheil Feizi This is my paper

Pith reviewed 2026-05-18 03:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords temporal blindnessLLM agentstool usehuman temporal perceptionTicToc datasetelapsed timemulti-turn interactionsalignment

0 comments

The pith

LLM agents are temporally blind and align with human preferences on tool calls at rates no better than 65 percent even when given timestamps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLM agents default to assuming a stationary context and therefore do not factor in real-world time elapsed between messages when deciding whether to invoke tools. To measure the resulting misalignment, the authors created the TicToc dataset of 76 multi-turn scenarios spanning high, medium, and low time sensitivity and gathered human preference labels for calling a tool versus answering directly. Existing models were then tested on these scenarios with and without explicit time information, revealing that no model exceeds a normalized alignment rate of 65 percent. The work further shows that simple prompting strategies offer little improvement for most models while certain post-training methods can raise alignment. This matters for any deployment in which agents must act on changing information over time rather than on a frozen snapshot.

Core claim

LLM agents suffer from temporal blindness because they treat context as stationary and therefore fail to incorporate elapsed real-world time when choosing between tool invocation and direct answers; human preference data on the TicToc dataset of 76 scenarios shows that current models achieve at most 65 percent normalized alignment even when timestamps are supplied, though targeted post-training can improve this match.

What carries the argument

The TicToc dataset of 76 multi-turn trajectories with human preference labels for tool calling versus direct answering under varying elapsed times.

If this is right

Agents that ignore elapsed time will either act on stale information and skip needed tool calls or waste effort by repeating calls unnecessarily.
Supplying timestamps in prompts produces only limited gains in decision quality for most existing models.
Post-training on time-aware preference data can raise alignment between multi-turn tool-use decisions and human temporal expectations.
Dynamic environments require agents that treat time progression as an explicit input rather than an implicit afterthought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world deployments in monitoring, personal assistance, or live customer service may need explicit time-modeling layers beyond current post-training fixes.
The misalignment could worsen over longer sessions, suggesting tests on extended conversation lengths to measure compounding errors.
Architectural additions such as internal clock states or time embeddings might be explored as alternatives to preference tuning alone.

Load-bearing premise

The human preference labels collected on the TicToc dataset of 76 scenarios constitute a reliable and representative ground truth for how humans expect agents to handle elapsed time in tool-use decisions.

What would settle it

New human labels collected on a fresh set of time-sensitive scenarios where a model given only timestamps reaches over 80 percent alignment with those labels.

Figures

Figures reproduced from arXiv: 2510.23853 by Arshia Soltani Moakhar, Chenrui Fan, Kazem Faghih, Parsa Hosseini, Soheil Feizi, Wenxiao Wang, Yize Cheng, Zahra Sodagar.

**Figure 1.** Figure 1: Illustrative examples showing the liability of temporal blindness in multi-turn LLM agents, in comparison to human. The first row shows the case of over-reliance, where the model displays over-confidence in outdated context, resulting in erroneous outputs. The second row shows the case of under-reliance, where the model displays excessive caution through repeated tool calls, resulting in unnecessary delays… view at source ↗

**Figure 2.** Figure 2: Normalized preference alignment rate of all models with and without timestamps. Without timestamps, models perform only slightly above random (max alignment marginally exceeding 60%). With timestamps, larger commercial models improve modestly, peaking just over 65%. It can be seen that without temporal information, most models perform only slightly better than random guessing, with the highest normalized … view at source ↗

**Figure 3.** Figure 3: Model attempt rates for both prefer-Tool and prefer-noTool cases. Without timestamps, models show varying tool-use biases; with timestamps, attempt rates rise for both classes, indicating limited alignment with human-like temporal awareness. further indicates that current models struggle to effectively exploit temporal information and fail to align their tool-use decisions with human time perception. 4.2 T… view at source ↗

**Figure 4.** Figure 4: Normalized preference alignment rate of Qwen3-8B with and without long CoT reasoning, under settings of both with and without timestamp. Long CoT reasoning shows no improvement in tool-use alignment with human time perception. benefits in areas such as robustness and security [42]. In this section, we investigate whether long CoT reasoning can enhance models’ ability to leverage temporal information a… view at source ↗

**Figure 5.** Figure 5: Prompt used to explicitly align models’ tool-use decisions with human expectations. Few-shot example rules are used to illustrate when tool calls are appropriate or unnecessary depending on elapsed time, providing models with explicit guidance. Note that the scenarios mentioned in the prompt do not overlap with our coverage in TicToc-v1. Ministral-8B Llama-3.1-8B Qwen-2.5-7B Llama-3.2-3B gpt-4.1-nano gpt-4… view at source ↗

**Figure 6.** Figure 6: Normalized preference alignment rate of all models with and without prompting-based alignment efforts. Detailed prompting with few-shot example rules yields notable boost in alignment rate for advanced reasoning models (o3, o4-mini), but has little effect on most others. tool-use decisions in temporally dynamic environments requires targeted post-training rather than prompt engineering alone. 5 Conclusion,… view at source ↗

**Figure 7.** Figure 7: The prompt used for generating read-only samples with an in-context example. To fully leverage the instruction following capability of the model, we generate the samples for one variant at a time by passing in a preferred_strategy. For verification, we employed GPT-4.1 as an initial LLM-as-judge filter prior to human inspection. The prompt specifying the verification rules provided to the judge model is sh… view at source ↗

**Figure 8.** Figure 8: The prompt used for generating read+write samples with an in-context example. To fully leverage the instruction following capability of the model, we generate the samples for one variant at a time by passing in a preferred_strategy. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt we use to verify samples with LLM judge. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Example trajectories with timestamps for both [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Large language model (LLM) agents are increasingly used to interact with and execute tasks in dynamic environments. However, a critical yet overlooked limitation of these agents is that they, by default, assume a stationary context, failing to account for the real-world time elapsed between messages. We refer to this as "temporal blindness". This limitation hinders decisions about when to invoke tools, leading agents to either over-rely on stale context and skip needed tool calls, or under-rely on it and redundantly repeat tool calls. To study this challenge, we constructed TicToc, a diverse dataset of multi-turn user-agent message trajectories across 76 scenarios, spanning dynamic environments with high, medium, and low time sensitivity. We collected human preferences between "calling a tool" and "directly answering" on each sample, and evaluated how well LLM tool-calling decisions align with human preferences under varying amounts of elapsed time. Our analysis reveals that existing models display poor alignment with human temporal perception, with no model achieving a normalized alignment rate better than 65% when given time stamp information. We also show that naive, prompt-based alignment techniques have limited effectiveness for most models, but specific post-training alignment can be a viable way to align multi-turn LLM tool use with human temporal perception. Our data and findings provide a first step toward understanding and mitigating temporal blindness, offering insights to foster the development of more time-aware and human-aligned agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper flags temporal blindness in LLM agents and backs it with a TicToc benchmark showing at most 65% alignment to human time preferences on tool calls.

read the letter

The main thing your colleague should know is that this paper points out how LLM agents fail to consider elapsed time when making tool calls, leading to either stale answers or unnecessary repeats, and their TicToc benchmark puts a number on it with no model beating 65% alignment to human judgments. They do something useful by building a dataset of 76 scenarios that vary in time sensitivity and gathering human preferences for each one on whether to call a tool or answer directly. Running models with and without time information and comparing to those preferences gives a clear empirical result. It's also helpful that they check simple prompt fixes and find them limited, while suggesting post-training as a better path. The softer part is the human preference data itself. The stress test note is right to flag that without details on annotator agreement or how representative the labels are, the misalignment rate might not be as robust as it looks. Seventy-six scenarios is a reasonable start but leaves room for questions about coverage and consistency. The abstract doesn't give dataset sizes per scenario or significance tests, so the central claim needs more backing from the full methods. This paper is aimed at researchers focused on making LLM agents more reliable in ongoing, real-world interactions. Anyone thinking about multi-turn tool use or context management could get value from the benchmark idea and the initial numbers. I would recommend sending it for peer review. The idea is solid enough and the problem is real, but the evaluation needs scrutiny to confirm the findings hold up.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLM agents exhibit 'temporal blindness' by defaulting to stationary context and failing to account for elapsed real-world time in multi-turn tool-use decisions. To quantify this, the authors introduce the TicToc dataset of 76 scenarios spanning high/medium/low time sensitivity, collect human preference labels for tool-calling versus direct answering, and evaluate model alignment with these labels (with and without explicit timestamps). They report that no evaluated model exceeds 65% normalized alignment and that prompt-based interventions are largely ineffective while post-training alignment offers a viable path forward.

Significance. If the human preference labels prove reliable, the work identifies a concrete, practically relevant gap in current agent tool-use behavior and supplies an initial benchmark (TicToc) plus dataset for future time-aware systems. The empirical framing and release of the preference data are constructive contributions that could support reproducible follow-up work on temporal reasoning in agents.

major comments (2)

[Methodology] Methodology / data collection section: the central 65% normalized alignment ceiling rests on human preference labels collected over the 76 TicToc scenarios, yet no inter-annotator agreement, number of annotators, annotator demographics, or consistency checks are reported. Without these, it is impossible to determine whether the measured misalignment reflects model limitations or properties of the label collection process.
[Results] Results section: the paper states a 'normalized alignment rate' but provides no definition, no per-scenario breakdown, no statistical significance tests, and no confidence intervals around the 65% figure. These omissions make it difficult to assess the robustness of the claim that no model exceeds this threshold.

minor comments (2)

[Experimental setup] Clarify the exact procedure for injecting timestamps into the model prompts and how 'elapsed time' is operationalized across the high/medium/low sensitivity categories.
[Introduction] The abstract and introduction introduce the term 'temporal blindness' without a precise operational definition; a short formalization would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Methodology] Methodology / data collection section: the central 65% normalized alignment ceiling rests on human preference labels collected over the 76 TicToc scenarios, yet no inter-annotator agreement, number of annotators, annotator demographics, or consistency checks are reported. Without these, it is impossible to determine whether the measured misalignment reflects model limitations or properties of the label collection process.

Authors: We agree that these details are necessary to establish the reliability of the human preference labels. The original submission omitted a full description of the annotation protocol. In the revised manuscript we will expand the methodology section with the number of annotators, their relevant demographics and qualifications, inter-annotator agreement statistics, and the consistency checks that were performed. These additions will allow readers to better evaluate whether the reported misalignment stems from model behavior rather than label variability. revision: yes
Referee: [Results] Results section: the paper states a 'normalized alignment rate' but provides no definition, no per-scenario breakdown, no statistical significance tests, and no confidence intervals around the 65% figure. These omissions make it difficult to assess the robustness of the claim that no model exceeds this threshold.

Authors: We accept that the results presentation requires greater transparency. We will add an explicit definition of the normalized alignment rate, include a per-scenario breakdown (in an appendix or supplementary table), report statistical significance tests for model comparisons, and provide bootstrap-derived confidence intervals around the key figures including the 65% ceiling. These changes will strengthen the empirical support for our conclusions while leaving the overall findings unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark against external human labels

full rationale

The paper constructs the TicToc dataset of 76 scenarios, collects independent human preference labels on tool-calling vs. direct answering under elapsed time, and reports model alignment rates against those labels. No equations, fitted parameters, derivations, or self-citation chains are used to generate the central 65% alignment figure; the result is a direct comparison to externally collected judgments rather than a quantity forced by the models' own outputs or prior author work. The methodology is self-contained against the human benchmark and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that collected human preferences are a valid proxy for correct time-aware behavior and that the 76 scenarios adequately sample dynamic environments.

axioms (1)

domain assumption Human preferences between tool calling and direct answering on TicToc samples accurately reflect desirable agent behavior under elapsed time
The entire evaluation framework treats these preferences as ground truth against which models are scored.

invented entities (1)

temporal blindness no independent evidence
purpose: Conceptual label for the failure to account for real-world time elapsed between messages
New term introduced to name the observed limitation; no independent empirical handle provided beyond the benchmark itself.

pith-pipeline@v0.9.0 · 5822 in / 1221 out tokens · 32960 ms · 2026-05-18T03:43:00.174363+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
cs.AI 2026-05 unverdicted novelty 7.0

Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.
Evaluating Temporal Consistency in Multi-Turn Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
cs.AI 2026-05 unverdicted novelty 6.0

LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 2 Pith papers · 10 internal anchors

[1]

A. Abid, M. Farooqi, and J. Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, page 298–306, New York, NY , USA, 2021. Association for Computing Machinery

work page 2021
[2]

Introducing the model context protocol, November 2024

Anthropic. Introducing the model context protocol, November 2024

work page 2024
[3]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan....

work page 2022
[4]

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries, 2024

work page 2024
[5]

Z. Chu, J. Chen, Q. Chen, W. Yu, H. Wang, M. Liu, and B. Qin. Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models, 2024

work page 2024
[6]

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

work page 2024
[7]

Gaming tool preferences in agentic llms

K. Faghih, W. Wang, Y . Cheng, S. Bharti, G. Sriramanan, S. Balasubramanian, P. Hosseini, and S. Feizi. Gaming tool preferences in agentic llms.arXiv preprint arXiv:2505.18135, 2025

work page arXiv 2025
[8]

Fatemi, M

B. Fatemi, M. Kazemi, A. Tsitsulin, K. Malkan, J. Yim, J. Palowitch, S. Seo, J. Halcrow, and B. Perozzi. Test of time: A benchmark for evaluating LLMs on temporal reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[9]

Y . Ge, S. Romeo, J. Cai, R. Shu, M. Sunkara, Y . Benajiba, and Y . Zhang. Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues, 2025

work page 2025
[10]

Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

Google. Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

work page 2025
[11]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Gupta, P

V . Gupta, P. Kandoi, M. B. V ora, S. Zhang, Y . He, R. Reinanda, and V . Srikumar. Temptabqa: Temporal question answering for semi-structured tables, 2023

work page 2023
[13]

Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

Y . Huang, J. Shi, Y . Li, C. Fan, S. Wu, Q. Zhang, Y . Liu, P. Zhou, Y . Wan, N. Z. Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

work page arXiv 2023
[14]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li. Api-bank: A comprehensive benchmark for tool-augmented llms.arXiv preprint arXiv:2304.08244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Z. Liu, P. Han, H. Yu, H. Li, and J. You. Time-r1: Towards comprehensive temporal reasoning in llms, 2025

work page 2025
[17]

The llama 3 herd of models, 2024

Llama Team, AI at Meta. The llama 3 herd of models, 2024. 10

work page 2024
[18]

Lucy and D

L. Lucy and D. Bamman. Gender and representation bias in GPT-3 generated stories. In N. Ak- oury, F. Brahman, S. Chaturvedi, E. Clark, M. Iyyer, and L. J. Martin, editors,Proceedings of the Third Workshop on Narrative Understanding, pages 48–55, Virtual, June 2021. Association for Computational Linguistics

work page 2021
[19]

Augmented Language Models: a Survey

G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Un ministral, des ministraux: Introducing the world’s best edge models, Oct

Mistral AI Team. Un ministral, des ministraux: Introducing the world’s best edge models, Oct. 2024

work page 2024
[21]

s1: Simple test-time scaling

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024

work page 2024
[23]

Introducing OpenAI o3 and o4-mini, 2025

OpenAI. Introducing OpenAI o3 and o4-mini, 2025

work page 2025
[24]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[25]

Talm: Tool augmente d language models

A. Parisi, Y . Zhao, and N. Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

work page arXiv 2022
[26]

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning

work page
[27]

Y . Qin, K. Song, Y . Hu, W. Yao, S. Cho, X. Wang, X. Wu, F. Liu, P. Liu, and D. Yu. Infobench: Evaluating instruction following ability in large language models, 2024

work page 2024
[28]

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...

work page 2025
[29]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

work page 2024
[30]

H. Ross, A. S. Mahabaleshwarkar, and Y . Suhara. When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025

work page arXiv 2025
[31]

J. B. Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools, 2023

work page 2023
[32]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023
[33]

T. Shen, R. Jin, Y . Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y . Liu, and D. Xiong. Large language model alignment: A survey.arXiv preprint arXiv:2309.15025, 2023

work page arXiv 2023
[34]

Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023

work page 2023
[35]

J. Shi, Z. Yuan, G. Tie, P. Zhou, N. Z. Gong, and L. Sun. Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.11366, 1, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

X. Song, B. Liang, Y . Sun, C. Zhang, B. Wang, and R. Xu. Bridging time gaps: Temporal logic relations for enhancing temporal reasoning in large language models. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’25, page 3040–3044, New York, NY , USA, 2025. Association for Computing Machinery

work page 2025
[38]

Y . Song, W. Xiong, D. Zhu, W. Wu, H. Qian, M. Song, H. Huang, C. Li, K. Wang, R. Yao, et al. Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

work page arXiv 2023
[39]

Z. Su, J. Zhang, T. Zhu, X. Qu, J. Li, M. Zhang, and Y . Cheng. Timo: Towards better temporal reasoning for language models, 2024

work page 2024
[40]

S. M. T. I. Tonmoy, S. M. M. Zaman, V . Jain, A. Rani, V . Rawte, A. Chadha, and A. Das. A comprehensive survey of hallucination mitigation techniques in large language models, 2024

work page 2024
[41]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research

work page
[42]

W. Wang, P. Hosseini, and S. Feizi. Chain-of-defensive-thought: Structured reasoning elicits robustness in large language models against reference corruption.arXiv preprint arXiv:2504.20769, 2025

work page arXiv 2025
[43]

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations

work page
[44]

Wang and Y

Y . Wang and Y . Zhao. Tram: Benchmarking temporal reasoning for large language models, 2024

work page 2024
[45]

Y . Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu. Aligning large language models with human: A survey, 2023

work page 2023
[46]

Z. Wang, B. Bi, S. K. Pentyala, K. Ramnath, S. Chaudhuri, S. Mehrotra, Zixu, Zhu, X.-B. Mao, S. Asur, Na, and Cheng. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024

work page 2024
[47]

Z. Wang, S. Cai, G. Chen, A. Liu, X. Ma, and Y . Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents.arXiv preprint arXiv:2302.01560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[50]

Xiong, A

S. Xiong, A. Payani, R. Kompella, and F. Fekri. Large language models can learn temporal reasoning, 2024

work page 2024
[51]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[53]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[54]

C. Yuan, Q. Xie, J. Huang, and S. Ananiadou. Back to the future: Towards explainable temporal reasoning with large language models, 2023. 12

work page 2023
[55]

Zhang, H

P. Zhang, H. Zhang, X. Wang, F. Zhang, and F. Yu. A brief survey on temporal reasoning based on large language models. In2024 8th Asian Conference on Artificial Intelligence Technology (ACAIT), pages 7–11, 2024

work page 2024
[56]

Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

Y . Zhang, J. Chen, J. Wang, Y . Liu, C. Yang, C. Shi, X. Zhu, Z. Lin, H. Wan, Y . Yang, et al. Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

work page arXiv 2024
[57]

try again now

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models, 2023. 13 AGPTPrompts Used for Trajectory Creation and Verification To generate trajectories, we first manually authored a single exemplar multi-turn trajectory and used it as an in-context example to prompt GPT-4o to p...

work page 2023

[1] [1]

A. Abid, M. Farooqi, and J. Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, page 298–306, New York, NY , USA, 2021. Association for Computing Machinery

work page 2021

[2] [2]

Introducing the model context protocol, November 2024

Anthropic. Introducing the model context protocol, November 2024

work page 2024

[3] [3]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan....

work page 2022

[4] [4]

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries, 2024

work page 2024

[5] [5]

Z. Chu, J. Chen, Q. Chen, W. Yu, H. Wang, M. Liu, and B. Qin. Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models, 2024

work page 2024

[6] [6]

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

work page 2024

[7] [7]

Gaming tool preferences in agentic llms

K. Faghih, W. Wang, Y . Cheng, S. Bharti, G. Sriramanan, S. Balasubramanian, P. Hosseini, and S. Feizi. Gaming tool preferences in agentic llms.arXiv preprint arXiv:2505.18135, 2025

work page arXiv 2025

[8] [8]

Fatemi, M

B. Fatemi, M. Kazemi, A. Tsitsulin, K. Malkan, J. Yim, J. Palowitch, S. Seo, J. Halcrow, and B. Perozzi. Test of time: A benchmark for evaluating LLMs on temporal reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[9] [9]

Y . Ge, S. Romeo, J. Cai, R. Shu, M. Sunkara, Y . Benajiba, and Y . Zhang. Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues, 2025

work page 2025

[10] [10]

Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

Google. Agent2agent (a2a) protocol.https://google.github.io/A2A/, 2025

work page 2025

[11] [11]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Gupta, P

V . Gupta, P. Kandoi, M. B. V ora, S. Zhang, Y . He, R. Reinanda, and V . Srikumar. Temptabqa: Temporal question answering for semi-structured tables, 2023

work page 2023

[13] [13]

Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

Y . Huang, J. Shi, Y . Li, C. Fan, S. Wu, Q. Zhang, Y . Liu, P. Zhou, Y . Wan, N. Z. Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

work page arXiv 2023

[14] [14]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li. Api-bank: A comprehensive benchmark for tool-augmented llms.arXiv preprint arXiv:2304.08244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Z. Liu, P. Han, H. Yu, H. Li, and J. You. Time-r1: Towards comprehensive temporal reasoning in llms, 2025

work page 2025

[17] [17]

The llama 3 herd of models, 2024

Llama Team, AI at Meta. The llama 3 herd of models, 2024. 10

work page 2024

[18] [18]

Lucy and D

L. Lucy and D. Bamman. Gender and representation bias in GPT-3 generated stories. In N. Ak- oury, F. Brahman, S. Chaturvedi, E. Clark, M. Iyyer, and L. J. Martin, editors,Proceedings of the Third Workshop on Narrative Understanding, pages 48–55, Virtual, June 2021. Association for Computational Linguistics

work page 2021

[19] [19]

Augmented Language Models: a Survey

G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Un ministral, des ministraux: Introducing the world’s best edge models, Oct

Mistral AI Team. Un ministral, des ministraux: Introducing the world’s best edge models, Oct. 2024

work page 2024

[21] [21]

s1: Simple test-time scaling

N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi, L. Zettlemoyer, P. Liang, E. Candès, and T. Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024

work page 2024

[23] [23]

Introducing OpenAI o3 and o4-mini, 2025

OpenAI. Introducing OpenAI o3 and o4-mini, 2025

work page 2025

[24] [24]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[25] [25]

Talm: Tool augmente d language models

A. Parisi, Y . Zhao, and N. Fiedel. Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

work page arXiv 2022

[26] [26]

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning

work page

[27] [27]

Y . Qin, K. Song, Y . Hu, W. Yao, S. Cho, X. Wang, X. Wu, F. Liu, P. Liu, and D. Yu. Infobench: Evaluating instruction following ability in large language models, 2024

work page 2024

[28] [28]

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...

work page 2025

[29] [29]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

work page 2024

[30] [30]

H. Ross, A. S. Mahabaleshwarkar, and Y . Suhara. When2call: When (not) to call tools.arXiv preprint arXiv:2504.18851, 2025

work page arXiv 2025

[31] [31]

J. B. Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools, 2023

work page 2023

[32] [32]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023

[33] [33]

T. Shen, R. Jin, Y . Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y . Liu, and D. Xiong. Large language model alignment: A survey.arXiv preprint arXiv:2309.15025, 2023

work page arXiv 2023

[34] [34]

Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023

work page 2023

[35] [35]

J. Shi, Z. Yuan, G. Tie, P. Zhou, N. Z. Gong, and L. Sun. Prompt injection attack to tool selection in llm agents.arXiv preprint arXiv:2504.19793, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Reflexion: Language Agents with Verbal Reinforcement Learning

N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning, 2023.URL https://arxiv. org/abs/2303.11366, 1, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

X. Song, B. Liang, Y . Sun, C. Zhang, B. Wang, and R. Xu. Bridging time gaps: Temporal logic relations for enhancing temporal reasoning in large language models. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’25, page 3040–3044, New York, NY , USA, 2025. Association for Computing Machinery

work page 2025

[38] [38]

Y . Song, W. Xiong, D. Zhu, W. Wu, H. Qian, M. Song, H. Huang, C. Li, K. Wang, R. Yao, et al. Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

work page arXiv 2023

[39] [39]

Z. Su, J. Zhang, T. Zhu, X. Qu, J. Li, M. Zhang, and Y . Cheng. Timo: Towards better temporal reasoning for language models, 2024

work page 2024

[40] [40]

S. M. T. I. Tonmoy, S. M. M. Zaman, V . Jain, A. Rani, V . Rawte, A. Chadha, and A. Das. A comprehensive survey of hallucination mitigation techniques in large language models, 2024

work page 2024

[41] [41]

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar. V oyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research

work page

[42] [42]

W. Wang, P. Hosseini, and S. Feizi. Chain-of-defensive-thought: Structured reasoning elicits robustness in large language models against reference corruption.arXiv preprint arXiv:2504.20769, 2025

work page arXiv 2025

[43] [43]

X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations

work page

[44] [44]

Wang and Y

Y . Wang and Y . Zhao. Tram: Benchmarking temporal reasoning for large language models, 2024

work page 2024

[45] [45]

Y . Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu. Aligning large language models with human: A survey, 2023

work page 2023

[46] [46]

Z. Wang, B. Bi, S. K. Pentyala, K. Ramnath, S. Chaudhuri, S. Mehrotra, Zixu, Zhu, X.-B. Mao, S. Asur, Na, and Cheng. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024

work page 2024

[47] [47]

Z. Wang, S. Cai, G. Chen, A. Liu, X. Ma, and Y . Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents.arXiv preprint arXiv:2302.01560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[49] [49]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[50] [50]

Xiong, A

S. Xiong, A. Payani, R. Kompella, and F. Fekri. Large language models can learn temporal reasoning, 2024

work page 2024

[51] [51]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

work page 2023

[53] [53]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[54] [54]

C. Yuan, Q. Xie, J. Huang, and S. Ananiadou. Back to the future: Towards explainable temporal reasoning with large language models, 2023. 12

work page 2023

[55] [55]

Zhang, H

P. Zhang, H. Zhang, X. Wang, F. Zhang, and F. Yu. A brief survey on temporal reasoning based on large language models. In2024 8th Asian Conference on Artificial Intelligence Technology (ACAIT), pages 7–11, 2024

work page 2024

[56] [56]

Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

Y . Zhang, J. Chen, J. Wang, Y . Liu, C. Yang, C. Shi, X. Zhu, Z. Lin, H. Wan, Y . Yang, et al. Toolbehonest: A multi-level hallucination diagnostic benchmark for tool-augmented large language models.arXiv preprint arXiv:2406.20015, 2024

work page arXiv 2024

[57] [57]

try again now

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models, 2023. 13 AGPTPrompts Used for Trajectory Creation and Verification To generate trajectories, we first manually authored a single exemplar multi-turn trajectory and used it as an in-context example to prompt GPT-4o to p...

work page 2023