pith. sign in

arxiv: 2605.00964 · v1 · submitted 2026-05-01 · 💻 cs.IR · cs.AI· cs.HC

Seeking Information with RAG-Assistants: Does Model Size Matter in Human-AI Collaborations?

Pith reviewed 2026-05-09 18:29 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.HC
keywords RAGhuman-AI collaborationinformation seekingmodel sizeusabilityLLMmulti-turn interactionworkplace AI
0
0 comments X

The pith

Humans collaborating with RAG-assistants achieve significantly better results than the models alone, regardless of model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests RAG-based assistants in a multi-turn information seeking task modeled on workplace needs for compliance and data security. It compares performance when humans team up with the assistants against using the LLMs or RAG systems alone. The key finding is that the human-AI teams perform significantly better than the AI-only baselines across all tested model sizes of 3B, 8B, and 70B parameters. Participants' views on how usable and satisfying the systems are do not vary much with model size. The work argues for assessing AI in real interactive human settings instead of relying solely on standard benchmarks.

Core claim

In a controlled experiment with 112 participants, human-AI collaboration with RAG-assistants produced a significant performance gain over model-only baselines in a multi-turn information-seeking scenario, and this gain held steady whether the underlying model had 3B, 8B, or 70B parameters. Perceived usability and satisfaction showed little difference across model sizes.

What carries the argument

The multi-turn human-AI collaborative dynamic with RAG-assistants, where humans supply context and oversight during information retrieval and response generation.

If this is right

  • Hybrid human-AI systems deliver performance gains in information-seeking tasks that remain consistent across model scales from 3B to 70B.
  • User ratings of usability and satisfaction do not increase meaningfully with larger models during collaborative use.
  • Evaluations that include actual multi-turn human interactions capture benefits that isolated model benchmarks miss.
  • RAG-assistants paired with humans can support tasks involving compliance and secure data handling without requiring the largest models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Smaller models could be paired with humans to achieve most of the performance benefit at lower computational cost.
  • The pattern may appear in other interactive AI tasks where human input compensates for model shortcomings.
  • Direct tests with real sensitive data and legal constraints would be required to verify applicability beyond the lab setting.

Load-bearing premise

The multi-turn information-seeking scenario used in the experiment is representative of real workplace settings that require compliance with local legislation and secure handling of sensitive data.

What would settle it

A follow-up study using actual sensitive compliance queries in which human-AI teams show no performance advantage over model-only systems would falsify the claim of significant collaborative gains.

Figures

Figures reproduced from arXiv: 2605.00964 by Catholijn M. Jonker, Lennard C. Froma, Maaike H.T. de Boer, Max J. van Duijn, Tom Kouwenhoven.

Figure 1
Figure 1. Figure 1: An overview of the RAG pipeline. answering (QA) and information seeking in realistic workplace scenarios. Professionals are often tasked with analysing large volumes of documentation. While LLMs hold po￾tential to offer support in this process, their deployment is constrained in cases where data is sensitive and compliance with, for example, the EU AI Act is required. To ad￾dress such constraints, open, tr… view at source ↗
Figure 2
Figure 2. Figure 2: Performance as the accuracy of answers to the 9 test questions across conditions, compared to base-strategy (LLM+RAG) and knowledge base only setting (LLM-only). 4. Results Comparing the question’s evaluation from both independent raters yields a Cohen’s Kappa of .898. This high Cohen’s Kappa score shows little difference between raters’ judgments, which is unsurprising given the unambiguous answers to the… view at source ↗
read the original abstract

Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports results from a user study (N=112) evaluating RAG-based chatbot assistants in a multi-turn information-seeking task inspired by workplace compliance and sensitive-data scenarios. It compares human-AI collaboration performance and perceptions against LLM-only and LLM+RAG baselines across three model sizes (3B, 8B, 70B), claiming significant performance gains from human assistance that hold irrespective of model size, with little variation in perceived usability or satisfaction across sizes.

Significance. If the results hold, the work provides useful empirical evidence that hybrid human-AI systems can deliver performance benefits in multi-turn information-seeking tasks beyond what standalone LLMs achieve, even with smaller models. The inclusion of both objective performance metrics and subjective usability/satisfaction measures in a realistic collaborative setting, rather than benchmark-only evaluation, is a strength that could inform design of efficient hybrid assistants.

major comments (3)
  1. [§3] §3 (Methodology), Task Description: The scenario is described only as 'inspired by' workplace settings involving compliance with local legislation and secure handling of sensitive data, without specifying how these constraints (e.g., legal accuracy requirements, data-handling rules, or expert-validated ground truth) were operationalized or enforced in the task or evaluation. This is load-bearing for the central claim that hybrid gains generalize to real information-seeking scenarios.
  2. [§4] §4 (Results): The abstract and results state that performance gains are 'significant' and 'irrespective of model size,' yet no details are provided on the statistical tests (e.g., ANOVA, t-tests), p-values, effect sizes, per-condition sample sizes, or error bars/confidence intervals. This prevents verification that the independence from the 3B/8B/70B conditions is robust.
  3. [§4.2] §4.2 (Performance Analysis): The comparison to 'model-only baselines' does not clarify whether the LLM-only condition includes the same RAG retrieval component as the human-assisted condition; if not, the attributed human-AI gain may be confounded by the presence of RAG itself rather than the collaboration dynamic.
minor comments (2)
  1. [Abstract] Abstract: The total N=112 is reported but not broken down by experimental condition or model size, which would improve clarity for interpreting the 'irrespective of model size' result.
  2. [Figures] Figure captions and legends (throughout): Some figures comparing performance across model sizes would benefit from explicit indication of statistical significance markers and sample sizes per bar.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide point-by-point responses to the major comments below, and we will incorporate revisions as indicated.

read point-by-point responses
  1. Referee: §3 (Methodology), Task Description: The scenario is described only as 'inspired by' workplace settings involving compliance with local legislation and secure handling of sensitive data, without specifying how these constraints (e.g., legal accuracy requirements, data-handling rules, or expert-validated ground truth) were operationalized or enforced in the task or evaluation. This is load-bearing for the central claim that hybrid gains generalize to real information-seeking scenarios.

    Authors: We agree that more detail on the operationalization of the task constraints is necessary. In the revised manuscript, we will expand the Methodology section to specify how the compliance and sensitive data handling aspects were implemented, including the provision of expert-validated ground truth for accuracy assessment and the specific rules participants and the AI assistant were instructed to adhere to during the interactions. revision: yes

  2. Referee: §4 (Results): The abstract and results state that performance gains are 'significant' and 'irrespective of model size,' yet no details are provided on the statistical tests (e.g., ANOVA, t-tests), p-values, effect sizes, per-condition sample sizes, or error bars/confidence intervals. This prevents verification that the independence from the 3B/8B/70B conditions is robust.

    Authors: We apologize for the insufficient statistical reporting in the current version. We will update the Results section to include full details of the statistical tests conducted, including the use of ANOVA to assess effects across model sizes, t-tests for pairwise comparisons, reported p-values, effect sizes, per-condition sample sizes (with total N=112), and the addition of error bars or confidence intervals to figures. revision: yes

  3. Referee: §4.2 (Performance Analysis): The comparison to 'model-only baselines' does not clarify whether the LLM-only condition includes the same RAG retrieval component as the human-assisted condition; if not, the attributed human-AI gain may be confounded by the presence of RAG itself rather than the collaboration dynamic.

    Authors: We will revise §4.2 to explicitly clarify the experimental conditions. The human-AI collaboration condition incorporates RAG, the LLM-only baseline does not include RAG, and there is a separate LLM+RAG baseline without human involvement. This design allows us to demonstrate gains from human collaboration over both pure LLM and LLM+RAG conditions, thereby isolating the contribution of the human-AI interaction dynamic. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical user study

full rationale

This paper reports results from a controlled user study (N=112) comparing human-AI collaboration against LLM-only and LLM+RAG baselines in a multi-turn information-seeking task. The headline finding—that human-AI performance gains are significant irrespective of model size (3B/8B/70B)—is obtained directly from measured accuracy, usability, and satisfaction data collected in the experiment. No mathematical derivations, equations, fitted parameters, or predictive models are present that could reduce any claimed result to its inputs by construction. Self-citations, if any, support background or related work but do not carry the load-bearing argument; the evidence is primary experimental data rather than a self-referential chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical measurements from a controlled user study rather than theoretical axioms or derivations; standard assumptions of statistical significance testing are invoked implicitly.

axioms (1)
  • standard math Standard assumptions underlying statistical significance tests for performance differences in user studies
    The abstract states that performance gains are significant without detailing the tests or corrections applied.

pith-pipeline@v0.9.0 · 5553 in / 1316 out tokens · 31093 ms · 2026-05-09T18:29:53.723794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 1 internal anchor

  1. [1]

    A survey on evaluation of large language models

    Chang Y , Wang X, Wang J, Wu Y , Yang L, Zhu K, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology. 2024;15(3):1-45

  2. [2]

    GPT-NL: Towards a Public Interest Large Language Model

    Barbereau T, Dom L. GPT-NL: Towards a Public Interest Large Language Model. In: PI-AI@ KI; 2024. p. n/a

  3. [3]

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    Asai A, Wu Z, Wang Y , Sil A, Hajishirzi H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In: The Twelfth International Conference on Learning Representations; 2024. p. n/a. Available from:https: //openreview.net/forum?id=hSyW5go0v8

  4. [4]

    RankRAG: Uni- fying Context Ranking with Retrieval-Augmented Generation in LLMs

    Yu Y , Ping W, Liu Z, Wang B, You J, Zhang C, et al. RankRAG: Uni- fying Context Ranking with Retrieval-Augmented Generation in LLMs. In: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, et al., editors. Advances in Neural Information Processing Systems. vol. 37. Curran Associates, Inc.; 2024. p. 121156-84. Available from: https://proceeding...

  5. [5]

    Speculative RAG: En- hancing Retrieval Augmented Generation through Drafting

    Wang Z, Wang Z, Le L, Zheng S, Mishra S, Perot V , et al. Speculative RAG: En- hancing Retrieval Augmented Generation through Drafting. In: The Thirteenth In- ternational Conference on Learning Representations; 2025. p. n/a. Available from: https://openreview.net/forum?id=xgQfWbV6Ey

  6. [6]

    A Survey on RAG with LLMs

    Arslan M, Ghanem H, Munawar S, Cruz C. A Survey on RAG with LLMs. Procedia Computer Science. 2024;246:3781-90

  7. [7]

    CRAG - comprehensive RAG benchmark

    Yang X, Sun K, Xin H, Sun Y , Bhalla N, Chen X, et al. CRAG - comprehensive RAG benchmark. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. NIPS ’24. Red Hook, NY , USA: Curran Asso- ciates Inc.; 2024. p. n/a

  8. [8]

    Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models

    Lyu Y , Li Z, Niu S, Xiong F, Tang B, Wang W, et al. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. ACM Transactions on Information Systems. 2025;43(2):1-32

  9. [9]

    From matching to generation: A survey on generative information retrieval

    Li X, Jin J, Zhou Y , Zhang Y , Zhang P, Zhu Y , et al. From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems. 2025;43(3):1-62

  10. [10]

    A re- search agenda for hybrid intelligence: augmenting human intellect with collab- orative, adaptive, responsible, and explainable artificial intelligence

    Akata Z, Balliet D, De Rijke M, Dignum F, Dignum V , Eiben G, et al. A re- search agenda for hybrid intelligence: augmenting human intellect with collab- orative, adaptive, responsible, and explainable artificial intelligence. Computer. 2020;53(8):18-28

  11. [11]

    Large language models strug- gle to learn long-tail knowledge

    Kandpal N, Deng H, Roberts A, Wallace E, Raffel C. Large language models strug- gle to learn long-tail knowledge. In: International Conference on Machine Learn- ing. PMLR; 2023. p. 15696-707

  12. [12]

    Webbrain: Learning to gen- erate factually correct articles for queries by grounding on large web corpus

    Qian H, Zhu Y , Dou Z, Gu H, Zhang X, Liu Z, et al. Webbrain: Learning to gen- erate factually correct articles for queries by grounding on large web corpus. arXiv preprint arXiv:230404358. 2023

  13. [13]

    Webgpt: Browser-assisted question-answering with human feedback

    Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:211209332. 2021. May 5, 2026

  14. [14]

    Retrieval- augmented generation for knowledge-intensive nlp tasks

    Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V , Goyal N, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural infor- mation processing systems. 2020;33:9459-74

  15. [15]

    In-Context Retrieval-Augmented Language Models

    Ram O, Levine Y , Dalmedigos I, Muhlgay D, Shashua A, Leyton-Brown K, et al. In-Context Retrieval-Augmented Language Models. Transactions of the Associ- ation for Computational Linguistics. 2023;11:1316-31. Available from:https: //aclanthology.org/2023.tacl-1.75/

  16. [16]

    Query rewriting in retrieval-augmented large language models

    Ma X, Gong Y , He P, Zhao H, Duan N. Query rewriting in retrieval-augmented large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023. p. 5303-15

  17. [17]

    Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation

    Dhole KD, Agichtein E. Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation. In: European Conference on Information Retrieval. Springer; 2024. p. 326-35

  18. [18]

    Interactive information retrieval

    Ruthven I. Interactive information retrieval. Annual review of information science and technology. 2008;42:43-92

  19. [19]

    A Survey of Conversational Search

    Mo F, Mao K, Zhao Z, Qian H, Chen H, Cheng Y , et al. A Survey of Conversational Search. ACM Trans Inf Syst. 2025 Aug. Just Accepted. Available from:https: //doi.org/10.1145/3759453

  20. [20]

    Small Language Models are the Future of Agentic AI

    Belcak P, Heinrich G, Diao S, Fu Y , Dong X, Muralidharan S, et al. Small Language Models are the Future of Agentic AI. arXiv preprint arXiv:250602153. 2025

  21. [21]

    Minirag: Towards extremely simple retrieval- augmented generation

    Fan T, Wang J, Ren X, Huang C. Minirag: Towards extremely simple retrieval- augmented generation. arXiv preprint arXiv:250106713. 2025

  22. [22]

    A Survey on Retrieval-Augmented Text Generation for Large Language Models

    Huang Y , Huang JX. A Survey on Retrieval-Augmented Text Generation for Large Language Models. ACM Comput Surv. 2026 Apr. Just Accepted. Available from: https://doi.org/10.1145/3805774

  23. [23]

    Scaling retrieval-based language models with a trillion-token datastore

    Shao R, He J, Asai A, Shi W, Dettmers T, Min S, et al. Scaling retrieval-based language models with a trillion-token datastore. Advances in Neural Information Processing Systems. 2024;37:91260-99

  24. [24]

    When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human- AI Collaboration

    Shi Q, Jimenez CE, Yao S, Haber N, Yang D, Narasimhan KR. When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human- AI Collaboration. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems; 2025. p. n/a. Available from:https://openreview.net/ forum?id=9V2SVEl1vP

  25. [25]

    Dickerson

    Bona FBD, Dominici G, Miller T, Langheinrich M, Gjoreski M. Evaluat- ing Explanations Through LLMs: Beyond Traditional User Studies. CoRR. 2024;abs/2410.17781. Available from:https://doi.org/10.48550/arXiv. 2410.17781

  26. [26]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A Survey on Hal- lucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans Inf Syst. 2025 Jan;43(2). Available from:https: //doi.org/10.1145/3703155

  27. [27]

    Go-tuning: Improving zero-shot learning abilities of smaller language models

    Xu J, Dong Q, Liu H, Li L. Go-tuning: Improving zero-shot learning abilities of smaller language models. arXiv preprint arXiv:221210461. 2022

  28. [28]

    Lost in the Middle: How Language Models Use Long Contexts

    Liu NF, Lin K, Hewitt J, Paranjape A, Bevilacqua M, Petroni F, et al. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Asso- ciation for Computational Linguistics. 2024;12:157-73. Available from:https: //aclanthology.org/2024.tacl-1.9/. May 5, 2026

  29. [29]

    Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification

    Lepagnol P, Gerald T, Ghannay S, Servan C, Rosset S. Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification. In: Calzolari N, Kan MY , Hoste V , Lenci A, Sakti S, Xue N, editors. Proceedings of the 2024 Joint Inter- national Conference on Computational Linguistics, Language Resources and Eval- uation (LREC-COLING 2024). Torino,...

  30. [30]

    Rouge: A package for automatic evaluation of summaries

    Lin CY . Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out; 2004. p. 74-81

  31. [31]

    Bleu: a method for automatic evaluation of machine translation

    Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics; 2002. p. 311-8

  32. [32]

    Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan

    Friel R, Belyi M, Sanyal A. n/a, editor. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. n/a; 2024. Available from:https:// arxiv.org/abs/2407.11005

  33. [33]

    RAGAs: Automated Evalua- tion of Retrieval Augmented Generation

    Es S, James J, Espinosa Anke L, Schockaert S. RAGAs: Automated Evalua- tion of Retrieval Augmented Generation. In: Aletras N, De Clercq O, editors. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. St. Julians, Malta: Asso- ciation for Computational Linguistics; 2024. p. 150...

  34. [34]

    Position: TrustLLM: Trust- worthiness in Large Language Models

    Huang Y , Sun L, Wang H, Wu S, Zhang Q, Li Y , et al. Position: TrustLLM: Trust- worthiness in Large Language Models. In: Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, et al., editors. Proceedings of the 41st International Conference on Machine Learning. vol. 235 of Proceedings of Machine Learning Research. PMLR; 2024. p. 20166-270....

  35. [35]

    A comparison of llm finetuning methods & evaluation metrics with travel chatbot use case

    Meyer S, Singh S, Tam B, Ton C, Ren A. A comparison of llm finetuning methods & evaluation metrics with travel chatbot use case. arXiv preprint arXiv:240803562. 2024

  36. [36]

    Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration

    Mangold A, Hoffmann K. Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration. arXiv preprint arXiv:250926205. 2025

  37. [37]

    Experimental evidence on the productivity effects of generative artificial intelligence

    Noy S, Zhang W. Experimental evidence on the productivity effects of generative artificial intelligence. Science. 2023;381(6654):187-92

  38. [38]

    When combinations of humans and AI are useful: A systematic review and meta-analysis

    Vaccaro M, Almaatouq A, Malone T. When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour. 2024;8(12):2293-303

  39. [39]

    A Tax- onomy for Human-LLM Interaction Modes: An Initial Exploration

    Gao J, Gebreegziabher SA, Choo KTW, Li TJJ, Perrault ST, Malone TW. A Tax- onomy for Human-LLM Interaction Modes: An Initial Exploration. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. CHI EA ’24. New York, NY , USA: Association for Computing Machinery; 2024. p. n/a. Available from:https://doi.org/10.1145/3613905.3650786

  40. [40]

    Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making; 2022

    Schemmer M, Hemmer P, K ¨uhl N, Benz C, Satzger G. Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making; 2022

  41. [41]

    From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis

    Li Z, Zhu H, Lu Z, Xiao Z, Yin M. From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems; 2025. p. 1-18. May 5, 2026

  42. [42]

    Medical artificial intelligence ethics: A systematic review of empirical studies

    Tang L, Li J, Fantus S. Medical artificial intelligence ethics: A systematic review of empirical studies. Digital health. 2023;9:20552076231186064

  43. [43]

    Ai assistance in legal analysis: An empirical study

    Choi JH, Schwarcz D. Ai assistance in legal analysis: An empirical study. J Legal Educ. 2024;73:384

  44. [44]

    How to evaluate trust in AI-assisted deci- sion making? A survey of empirical methodologies

    Vereschak O, Bailly G, Caramiaux B. How to evaluate trust in AI-assisted deci- sion making? A survey of empirical methodologies. Proceedings of the ACM on Human-Computer Interaction. 2021;5(CSCW2):1-39

  45. [45]

    ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

    Xu Y , Liu X, Liu X, Hou Z, Li Y , Zhang X, et al. ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline. In: Al- Onaizan Y , Bansal M, Chen YN, editors. Findings of the Association for Compu- tational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Compu- tational Linguistics; 2024. p. 9733-60. ...

  46. [46]

    CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

    Zhang K, Li J, Li G, Shi X, Jin Z. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. In: Ku LW, Martins A, Srikumar V , editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Computationa...

  47. [47]

    Factors Affecting Human-Generated AI Collaboration: Trust and Perceived Usefulness as Mediators

    Chae HS, Yoon C. Factors Affecting Human-Generated AI Collaboration: Trust and Perceived Usefulness as Mediators. Information. 2025;16(10):856

  48. [48]

    Trust in AI: progress, challenges, and future directions

    Afroogh S, Akbari A, Malone E, Kargar M, Alambeigi H. Trust in AI: progress, challenges, and future directions. Humanities and Social Sciences Communica- tions. 2024;11(1):1568

  49. [49]

    Trust in AI Chatbots: The perceived expertise of CHATGPT in subjective and objective tasks

    Ramrath M, Scharmann A, Ridder A, K ¨uhn T, Weller S, Kr ¨amer N. Trust in AI Chatbots: The perceived expertise of CHATGPT in subjective and objective tasks. In: HHAI 2024: hybrid human ai systems for the social good. IOS Press; 2024. p. 264-80

  50. [50]

    Developing an interaction framework for human-large language models collaboration in creative tasks: Insights from UX professionals’ communication with ChatGPT

    Zhou Z, Gao W, Li Y , Yu J. Developing an interaction framework for human-large language models collaboration in creative tasks: Insights from UX professionals’ communication with ChatGPT. Available at SSRN 4853257. 2024

  51. [51]

    Technology acceptance model: TAM

    Davis FD, et al. Technology acceptance model: TAM. Al-Suqri, MN, Al-Aufi, AS: Information seeking behavior and technology adoption. 1989;205(219):5

  52. [52]

    Does artificial intelligence satisfy you? A meta-analysis of user gratification and user satisfaction with AI-powered chatbots

    Xie C, Wang Y , Cheng Y . Does artificial intelligence satisfy you? A meta-analysis of user gratification and user satisfaction with AI-powered chatbots. International Journal of Human–Computer Interaction. 2024;40(3):613-23

  53. [53]

    Interpretable User Sat- isfaction Estimation for Conversational Systems with Large Language Models

    Lin YC, Neville J, Stokes J, Yang L, Safavi T, Wan M, et al. Interpretable User Sat- isfaction Estimation for Conversational Systems with Large Language Models. In: Ku LW, Martins A, Srikumar V , editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Com...

  54. [54]

    A speaker turn-aware multi-task adversarial network for joint user satisfaction estimation and sentiment analy- sis

    Song K, Kang Y , Liu J, Li X, Sun C, Liu X. A speaker turn-aware multi-task adversarial network for joint user satisfaction estimation and sentiment analy- sis. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelli- gence and Thirty-Fifth Conference on Innovative Applications of Artificial Intel- May 5, 2026 ligence and Thirteenth Sy...

  55. [55]

    Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation

    Wang W, Jiao W, Hao Y , Wang X, Shi S, Tu Z, et al. Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Dublin, Ireland: Association for Computational L...

  56. [56]

    MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented gen- eration systems

    Katsis Y , Rosenthal S, Fadnis K, Gunasekara C, Lee YS, Popa L, et al. MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented gen- eration systems. Transactions of the Association for Computational Linguistics. 2025;13:784-808

  57. [57]

    The llama 3 herd of models

    Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al. The llama 3 herd of models. arXiv preprint arXiv:240721783. 2024

  58. [58]

    Multilingual e5 text embeddings: A technical report

    Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:240205672. 2024

  59. [59]

    MTEB: Massive Text Embedding Benchmark

    Muennighoff N, Tazi N, Magne L, Reimers N. MTEB: Massive Text Embedding Benchmark. In: Vlachos A, Augenstein I, editors. Proceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia: Association for Computational Linguistics; 2023. p. 2014-37. Available from:https://aclanthology.org/2023...

  60. [60]

    n/a, editor

    Nanni F, Chan R, Lazauskas T, Geddes J. n/a, editor. Why we still need small lan- guage models – even in the age of frontier AI. n/a; 2025.https://www.turing. ac.uk/blog/why-we-still-need-small-language-models

  61. [61]

    Random-effects models for longitudinal data

    Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982:963-74

  62. [62]

    The artificial-social- agent questionnaire: establishing the long and short questionnaire versions

    Fitrianie S, Bruijnes M, Li F, Abdulrahman A, Brinkman WP. The artificial-social- agent questionnaire: establishing the long and short questionnaire versions. In: Pro- ceedings of the 22nd ACM International Conference on Intelligent Virtual Agents

  63. [63]

    Use of ranks in one-criterion variance analysis

    Kruskal WH, Wallis W A. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association. 1952;47(260):583-621

  64. [64]

    Multiple comparisons using rank sums

    Dunn OJ. Multiple comparisons using rank sums. Technometrics. 1964;6(3):241- 52

  65. [65]

    Bonferroni and ˇSid´ak corrections for multiple comparisons

    Abdi H, et al. Bonferroni and ˇSid´ak corrections for multiple comparisons. Ency- clopedia of measurement and statistics. 2007;3(01):2007

  66. [66]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Suzgun M, Scales N, Sch¨arli N, Gehrmann S, Tay Y , Chung HW, et al. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguis- tics; 2023. p. 13003-51. Available from:h...

  67. [67]

    Instruction-following evaluation for large language models

    Zhou J, Lu T, Mishra S, Brahma S, Basu S, Luan Y , et al. Instruction-following evaluation for large language models. arXiv preprint arXiv:231107911. 2023. May 5, 2026

  68. [68]

    Who should i trust: Ai or myself? leveraging human and ai correctness likelihood to promote appropriate trust in ai-assisted decision-making

    Ma S, Lei Y , Wang X, Zheng C, Shi C, Yin M, et al. Who should i trust: Ai or myself? leveraging human and ai correctness likelihood to promote appropriate trust in ai-assisted decision-making. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems; 2023. p. 1-19

  69. [69]

    HyEnA: A Hybrid Method for Extracting Arguments from Opinions

    Van Der Meer M, Liscio E, Jonker CM, Plaat A, V ossen P, Murukannaiah PK. HyEnA: A Hybrid Method for Extracting Arguments from Opinions. In: Schlobach S, Perez-Ortiz M, Tielman M, editors. HHAI2022: Augmenting Human Intellect – Proceedings of the 1st International Conference on Hybrid Human-Artificial Intel- ligence. vol. 354 of Frontiers in Artificial In...

  70. [70]

    Query2doc: Query Expansion with Large Language Mod- els

    Wang L, Yang N, Wei F. Query2doc: Query Expansion with Large Language Mod- els. In: Bouamor H, Pino J, Bali K, editors. Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing. Singapore: Associ- ation for Computational Linguistics; 2023. p. 9414-23. Available from:https: //aclanthology.org/2023.emnlp-main.585/