pith. machine review for the scientific record. sign in

arxiv: 2601.17617 · v3 · submitted 2026-01-24 · 💻 cs.IR · cs.CL

Recognition: 1 theorem link

· Lean Theorem

Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:17 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords agentic searchsearch session analysisquery reformulationintent detectioncontext-driven term adoptionmulti-turn information seekingLLM-powered agentslog mining
0
0 comments X

The pith

Analysis of 14 million agentic search sessions shows 54% of new query terms come from prior evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This study examines millions of real multi-step search interactions by LLM agents to map how sessions unfold and how retrieved results shape later queries. It groups the requests into sessions, labels their overall goals and individual term changes, then tracks whether fresh words in each query already appeared in earlier results. The work finds that most sessions stay short and rapid while the balance between repetition and new exploration shifts depending on the goal, such as fact-finding versus reasoning. These observations supply direct signals for when agents should halt or how they should preserve useful context across turns.

Core claim

The paper establishes that query reformulations in agentic search are substantially driven by retrieved evidence, with 54% of newly introduced terms on average appearing in the accumulated context from previous steps. It further identifies that fact-seeking sessions show rising repetition over time while reasoning sessions sustain wider exploration, and that over 90% of sessions have at most ten steps with most intervals under one minute. These patterns are derived from sessionizing logs and applying annotations to quantify term adoption and intent variations.

What carries the argument

Context-driven Term Adoption Rate (CTAR), the metric used to determine the lexical traceability of new query terms back to the body of evidence gathered across retrieval steps.

If this is right

  • Repetition patterns can serve as signals for developing stopping criteria tailored to different session intents.
  • Retrieval strategies can be made adaptive by allocating resources according to whether a session is fact-seeking or requires reasoning.
  • Maintaining and tracking evidence context across steps enables better support for query reformulation in agents.
  • Insights from session dynamics support the creation of repetition-aware and intent-adaptive search mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These findings imply that evaluation of search agents should incorporate metrics for context utilization and term traceability rather than relying solely on end-task success.
  • Agent designs could benefit from mechanisms that explicitly surface or highlight evidence sources to improve reformulation quality.
  • Extending the analysis to include the impact of different retrieval qualities on these patterns could reveal additional levers for optimization.

Load-bearing premise

The LLM-based annotation process for determining session intents and step-wise query reformulation labels accurately reflects the actual behaviors without substantial misclassification.

What would settle it

A controlled experiment where human annotators independently label a sample of sessions for intents and term origins, then compare agreement rates with the automated labels, would confirm or refute the reliability of the discovered patterns.

Figures

Figures reproduced from arXiv: 2601.17617 by Bruno Martins, Chenyan Xiong, Jamie Callan, Jingjie Ning, Jo\~ao Coelho, Jo\~ao Magalh\~aes, Yibo Kong, Yunfan Long.

Figure 1
Figure 1. Figure 1: Intent–trajectory structure of agentic search logs. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representativeness and diversity of the DRGym logs. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: distribution of session length (number of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Step-wise trajectory distribution trends for the first 10 steps across different task intents. Each sub-figure illustrates [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A Declarative retry-loop example dominated by [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: A reset-then-refine example: Specialization [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

LLM-powered search agents are increasingly being used for multi-step information seeking tasks, yet the IR community lacks empirical understanding of how agentic search sessions unfold and how retrieved evidence is reflected in later queries. This paper presents a large-scale log analysis of agentic search based on 14.44M search requests (3.97M sessions) collected from DeepResearchGym, i.e., an open-source search API accessed by external agentic clients. We sessionize the logs, assign session-level intents and step-wise query-reformulation labels using LLM-based annotation, and propose Context-driven Term Adoption Rate (CTAR) to quantify whether newly introduced query terms are lexically traceable to previously retrieved evidence. Our analyses reveal distinctive behavioral patterns. First, over 90\% of multi-turn sessions contain at most ten steps, and 89\% of inter-step intervals fall under one minute. Second, behavior varies by intent. Fact-seeking sessions exhibit high repetition that increases over time, while sessions requiring reasoning sustain broader exploration. Third, query reformulations are often traceable to retrieved evidence across steps. On average, 54\% of newly introduced query terms appear in the accumulated evidence context, with additional traceability to earlier steps beyond the most recent retrieval. These findings provide candidate signals for repetition-aware stopping, intent-adaptive retrieval budgeting, and explicit cross-step context tracking. We released the anonymized logs, making them available at a public HuggingFace~\chref{https://huggingface.co/datasets/cx-cmu/deepresearchgym-agentic-search-logs}{repository}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper analyzes 14.44M search requests across 3.97M sessions from the DeepResearchGym open API logs. It applies LLM-based annotation to label session-level intents and step-wise query reformulations, introduces the Context-driven Term Adoption Rate (CTAR) metric to measure lexical traceability of new query terms to accumulated retrieved evidence, and reports empirical patterns including short session lengths (over 90% of multi-turn sessions have at most 10 steps), sub-minute inter-step intervals (89%), intent-dependent behaviors (high repetition in fact-seeking vs. broader exploration in reasoning sessions), and an average 54% traceability rate for newly introduced terms, with some traceability extending beyond the most recent retrieval. The anonymized logs are released publicly.

Significance. If the LLM annotations prove reliable, the work delivers a valuable large-scale observational characterization of real-world agentic search trajectories that is rare in the IR literature. The public dataset release and concrete signals for repetition-aware stopping and intent-adaptive retrieval budgeting constitute concrete contributions that can directly inform agent design and evaluation.

major comments (2)
  1. [§3 (Annotation and Labeling Pipeline)] §3 (Annotation and Labeling Pipeline): The central 54% CTAR figure is computed from LLM-assigned reformulation labels that identify 'new' terms and define the evidence context, yet the manuscript reports no inter-annotator agreement, human validation subset, or sensitivity analysis on these labels. Any systematic bias in the LLM annotator directly affects both numerator and denominator of CTAR, rendering the quantitative claim unverifiable from the provided text.
  2. [§4.3 (CTAR Definition and Computation)] §4.3 (CTAR Definition and Computation): The precise operational definition of 'accumulated evidence context' (whether it includes only retrieved passages, prior queries, or both) and the exact procedure for determining term novelty across steps are not fully specified. Without this, it is impossible to reproduce or assess the reported 54% average and the claim of traceability to earlier steps.
minor comments (1)
  1. [Abstract] The abstract states that logs are released at a Hugging Face repository but does not include the exact dataset identifier or citation format, which would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major point below and commit to revisions that improve the clarity and verifiability of the work without altering its core claims.

read point-by-point responses
  1. Referee: §3 (Annotation and Labeling Pipeline): The central 54% CTAR figure is computed from LLM-assigned reformulation labels that identify 'new' terms and define the evidence context, yet the manuscript reports no inter-annotator agreement, human validation subset, or sensitivity analysis on these labels. Any systematic bias in the LLM annotator directly affects both numerator and denominator of CTAR, rendering the quantitative claim unverifiable from the provided text.

    Authors: We agree that the absence of reported validation metrics for the LLM annotations is a limitation that affects confidence in the CTAR results. In the revised version we will add a human validation study on a stratified random sample of 1,000 sessions (balanced across intents), with two independent human annotators. We will report Cohen's kappa for inter-annotator agreement between humans and between humans and the LLM, plus a sensitivity analysis that varies the LLM prompt template and model temperature. These additions will directly support the reliability of the 54% figure. revision: yes

  2. Referee: §4.3 (CTAR Definition and Computation): The precise operational definition of 'accumulated evidence context' (whether it includes only retrieved passages, prior queries, or both) and the exact procedure for determining term novelty across steps are not fully specified. Without this, it is impossible to reproduce or assess the reported 54% average and the claim of traceability to earlier steps.

    Authors: We acknowledge that the current description in §4.3 leaves room for ambiguity. The accumulated evidence context is strictly the concatenation of all retrieved passages (not prior queries) up to and including the current step; term novelty is operationalized by tokenizing queries with the same tokenizer used for passage indexing and checking whether a token appears in the current query but not in any prior query. Traceability is measured by exact lexical match (case-insensitive) within the evidence context. In the revision we will insert explicit pseudocode, a formal definition of the context window, and an example walkthrough of a three-step session to make the 54% computation fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational log analysis with defined metric

full rationale

The paper conducts empirical analysis on 14.44M search requests by sessionizing logs, applying LLM annotation for intents and reformulation labels, defining the CTAR metric as the proportion of new query terms traceable to accumulated evidence, and reporting observed statistics (e.g., 54% average). No derivations, equations, fitted parameters presented as predictions, or self-citation chains reduce any reported result to the inputs by construction. The 54% figure is a direct computation on the annotated data using the explicitly defined metric, not a tautology or forced outcome. This is a standard empirical study with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The analysis rests on the assumption that LLM annotations faithfully capture intent and reformulation types and that the DeepResearchGym logs are representative of broader agentic search behavior.

axioms (1)
  • domain assumption LLM-based annotation reliably identifies session-level intents and step-wise query-reformulation labels
    Used to label all 3.97M sessions and individual steps; no validation metrics provided in abstract.
invented entities (1)
  • Context-driven Term Adoption Rate (CTAR) no independent evidence
    purpose: Quantify the fraction of newly introduced query terms that are lexically traceable to previously retrieved evidence
    Newly defined metric in this paper.

pith-pipeline@v0.9.0 · 5615 in / 1321 out tokens · 40424 ms · 2026-05-16T11:17:44.010558+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 13 internal anchors

  1. [1]

    Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. InInternational Conference on Research and Development in Information Retrieval (SIGIR)

  2. [2]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi

  3. [3]

    InInternational Conference on Learning Representations (ICLR)

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self- Reflection. InInternational Conference on Learning Representations (ICLR)

  4. [4]

    Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. 2011. Query reformulation mining: models, patterns, and applications.Information Retrieval

  5. [5]

    Andrei Broder. 2002. A taxonomy of web search.SIGIR Forum

  6. [6]

    Aaron Brown and Matt Saner. 2025. The Agentic AI Security Scoping Matrix: A framework for securing autonomous AI systems. AWS Security Blog. Published: 21 Nov 2025. Accessed: 29 Dec 2025. (2025). https://aws.amazon.com/cn/blogs /security/the-agentic-ai-security-scoping-matrix-a-framework-for-securin g-autonomous-ai-systems/

  7. [7]

    Wei-Lin Chiang et al. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. (2024). arXiv: 2403.04132

  8. [8]

    João Coelho et al. 2025. DeepResearchGym: A Free, Transparent, and Repro- ducible Evaluation Sandbox for Deep Research. (2025). arXiv: 2505.19253

  9. [9]

    Russell, Diane Tang, and Jaime Teevan

    Susan Dumais, Robin Jeffries, Daniel M. Russell, Diane Tang, and Jaime Teevan

  10. [10]

    Understanding User Behavior Through Log Data and Analysis.Ways of Knowing in HCI

  11. [11]

    Carsten Eickhoff, Sebastian Dungs, and Vu Tran. 2015. An Eye-Tracking Study of Query Reformulation. InConference on Research and Development in Infor- mation Retrieval (SIGIR)

  12. [12]

    Carsten Eickhoff, Jaime Teevan, Ryen White, and Susan Dumais. 2014. Lessons from the Journey: A Query Log Analysis of Within-session Learning. InInter- national Conference on Web Search and Data Mining (WSDM)

  13. [13]

    Feild, James Allan, and Rosie Jones

    Henry A. Feild, James Allan, and Rosie Jones. 2010. Predicting Searcher Frus- tration. InInternational Conference on Research and Development in Information Retrieval (SIGIR)

  14. [14]

    Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White. 2005. Evaluating implicit measures to improve web search.ACM Trans- actions on Information Systems

  15. [15]

    Google. 2025. Gemini 3 Developer Guide (model id: gemini-3-flash-preview). Google AI for Developers Documentation. Gemini 3 models in preview; model IDs listed in documentation. (2025). Retrieved Jan. 18, 2026 from https://ai.goo gle.dev/gemini-api/docs/gemini-3

  16. [16]

    Kunal Handa et al. 2025. Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations. (2025). arXiv: 2503.04761

  17. [17]

    Efthimiadis

    Jeff Huang and Efthimis N. Efthimiadis. 2009. Analyzing and evaluating query reformulation strategies in web search logs. InConference on Information and Knowledge Management (CIKM)

  18. [18]

    Jansen, Amanda Spink, Judy Bateman, and Tefko Saracevic

    Bernard J. Jansen, Amanda Spink, Judy Bateman, and Tefko Saracevic. 1998. Real life information retrieval: a study of user queries on the web.SIGIR Forum

  19. [19]

    Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravis- hankar Krishnawamy, and Rohan Kadekodi. 2019. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. InAdvances in Neural Information Processing Systems

  20. [20]

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park

  21. [21]

    InConference of the North American Chapter of the Association for Computational Linguistics (NAACL)

    Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. InConference of the North American Chapter of the Association for Computational Linguistics (NAACL)

  22. [22]

    Jiahe Jin, Abhijay Paladugu, and Chenyan Xiong. 2025. Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them. (2025). arXiv: 2510.06534

  23. [23]

    Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In International Conference on Research and Development in Information Retrieval (SIGIR)

  24. [24]

    Rosie Jones and Kristina Lisa Klinkner. 2008. Beyond the session timeout: au- tomatic hierarchical segmentation of search topics in query logs. InConference on Information and Knowledge Management (CIKM)

  25. [25]

    Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. (2020). arXiv: 2004.04906

  26. [26]

    Satyapriya Krishna, Kalpesh Krishna, Anhad Mohananey, Steven Schwarcz, Adam Stambler, Shyam Upadhyay, and Manaal Faruqui. 2025. Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generatio. InConference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)

  27. [27]

    Ido Levy, Ben wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov

  28. [28]

    InWorkshop on Computer Use Agents (ICML)

    ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustwor- thiness in Web Agents. InWorkshop on Computer Use Agents (ICML)

  29. [29]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. (2024). arXiv: 2412.05579

  30. [30]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilac- qua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilac- qua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics

  31. [31]

    Xiao Liu et al. 2023. AgentBench: Evaluating LLMs as Agents. (2023). arXiv: 2308.03688

  32. [32]

    Gary Marchionini. 2006. Exploratory search: from finding to understanding. Communications of the ACM

  33. [33]

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. GAIA: a benchmark for General AI Assistants. (2023). arXiv: 2311.12983

  34. [34]

    Reiichiro Nakano et al. 2022. WebGPT: Browser-assisted question-answering with human feedback. (2022). arXiv: 2112.09332

  35. [35]

    Rossi, and Swarat Chaudhuri

    Lunyiu Nie, Nedim Lipka, Ryan A. Rossi, and Swarat Chaudhuri. 2025. FlashRe- search: Real-time Agent Orchestration for Efficient Deep Research. (2025). arXiv: 2510.05145

  36. [36]

    OpenAI. 2025. GPT-5 nano Model. OpenAI API Documentation. Accessed: 2025-12-29. (2025). https://platform.openai.com/docs/models/gpt-5-nano

  37. [37]

    OpenAI. 2025. How People Use ChatGPT. Tech. rep. OpenAI

  38. [38]

    Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, and Jamie Callan. 2022. ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information. (2022). arXiv: 2211.15848

  39. [39]

    Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Mar- garet Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. (2024). arXiv: 2406.17557

  40. [40]

    Long Phan, Alice Gatti, Ziwen Han, et al. 2025. Humanity’s Last Exam. (2025). https://arxiv.org/abs/2501.14249 arXiv: 2501.14249

  41. [41]

    Yujia Qin et al. 2023. ToolLLM: facilitating large language models to master 16,000+ real-world APIs. (2023). arXiv: 2307.16789

  42. [42]

    Soo Young Rieh, Kevyn Collins-Thompson, Preben Hansen, and Hye-Jung Lee

  43. [43]

    Towards searching as a learning process: a review of current perspectives and future directions.Journal of Information Science

  44. [44]

    Rose and Danny Levinson

    Daniel E. Rose and Danny Levinson. 2004. Understanding user goals in web search. InInternational Conference on World Wide Web (WWW)

  45. [45]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. (2023). arXiv: 2302.047 61

  46. [46]

    Craig Silverstein, Hannes Marais, Monika Henzinger, and Michael Moricz. 1999. Analysis of a very large web search engine query log.SIGIR Forum

  47. [47]

    Jaime Teevan, Eytan Adar, Rosie Jones, and Michael A. S. Potts. 2007. Informa- tion re-retrieval: repeat queries in yahoo’s logs. InConference on Research and Development in Information Retrieval (SIGIR)

  48. [48]

    Ackerman, and David R

    Jaime Teevan, Christine Alvarado, Mark S. Ackerman, and David R. Karger

  49. [49]

    InConference on Human Factors in Computing Systems (CHI)

    The perfect search engine is not enough: a study of orienteering behavior in directed search. InConference on Human Factors in Computing Systems (CHI)

  50. [50]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

  51. [51]

    InAnnual Meeting of the Association for Com- putational Linguistics (ACL)

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge- Intensive Multi-Step Questions. InAnnual Meeting of the Association for Com- putational Linguistics (ACL)

  52. [52]

    Kelsey Urgo and Jaime Arguello. 2022. Learning assessments in search-as- learning: a survey of prior work and opportunities for future research.Infor- mation Processing and Management

  53. [53]

    Zhefan Wang, Ning Geng, Zhiqiang Guo, Weizhi Ma, and Min Zhang. 2025. Human vs. Agent in Task-Oriented Conversations. (2025). arXiv: 2509.17619

  54. [54]

    Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. 2025. How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations. (2025). arXiv: 2510.22780

  55. [55]

    White and Steven M

    Ryen W. White and Steven M. Drucker. 2007. Investigating behavioral variabil- ity in web search. InInternational Conference on World Wide Web (WWW)

  56. [56]

    Jialong Wu et al. 2025. WebWalker: Benchmarking LLMs in Web Traversa. In Annual Meeting of the Association for Computational Linguistics

  57. [57]

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Web- Shop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. InInternational Conference on Neural Information Processing Systems (NeurIPS)

  58. [58]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. (2023). arXiv: 2210.03629

  59. [59]

    Yanzhao Zhang et al. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. (2025). arXiv: 2506.05176

  60. [60]

    Yilun Zhao et al. 2025. SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks. (2025). arXiv: 2507.01001. SIGIR ’26, July 20–24, 2026, Melbourne | Naarm, Australia Ning et al

  61. [61]

    Lianmin Zheng et al. 2023. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. InInternational Conference on Neural Information Processing Systems (NeurIPS)

  62. [62]

    Lianmin Zheng et al. 2024. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. InInternational Conference on Learning Representations (ICLR)

  63. [63]

    Jianan Zhou, Fleur Corbett, Joori Byun, Talya Porat, and Nejra van Zalk. 2025. Psychological and behavioural responses in human-agent vs. human-human interactions: a systematic review and meta-analysis. (2025). arXiv: 2509.21542

  64. [64]

    Shuyan Zhou et al. 2023. Webarena: a realistic web environment for building autonomous agents. (2023). arXiv: 2307.13854