Seeking Information with RAG-Assistants: Does Model Size Matter in Human-AI Collaborations?

Catholijn M. Jonker; Lennard C. Froma; Maaike H.T. de Boer; Max J. van Duijn; Tom Kouwenhoven

arxiv: 2605.00964 · v1 · submitted 2026-05-01 · 💻 cs.IR · cs.AI· cs.HC

Seeking Information with RAG-Assistants: Does Model Size Matter in Human-AI Collaborations?

Lennard C. Froma , Tom Kouwenhoven , Maaike H.T. de Boer , Catholijn M. Jonker , Max J. van Duijn This is my paper

Pith reviewed 2026-05-09 18:29 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.HC

keywords RAGhuman-AI collaborationinformation seekingmodel sizeusabilityLLMmulti-turn interactionworkplace AI

0 comments

The pith

Humans collaborating with RAG-assistants achieve significantly better results than the models alone, regardless of model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests RAG-based assistants in a multi-turn information seeking task modeled on workplace needs for compliance and data security. It compares performance when humans team up with the assistants against using the LLMs or RAG systems alone. The key finding is that the human-AI teams perform significantly better than the AI-only baselines across all tested model sizes of 3B, 8B, and 70B parameters. Participants' views on how usable and satisfying the systems are do not vary much with model size. The work argues for assessing AI in real interactive human settings instead of relying solely on standard benchmarks.

Core claim

In a controlled experiment with 112 participants, human-AI collaboration with RAG-assistants produced a significant performance gain over model-only baselines in a multi-turn information-seeking scenario, and this gain held steady whether the underlying model had 3B, 8B, or 70B parameters. Perceived usability and satisfaction showed little difference across model sizes.

What carries the argument

The multi-turn human-AI collaborative dynamic with RAG-assistants, where humans supply context and oversight during information retrieval and response generation.

If this is right

Hybrid human-AI systems deliver performance gains in information-seeking tasks that remain consistent across model scales from 3B to 70B.
User ratings of usability and satisfaction do not increase meaningfully with larger models during collaborative use.
Evaluations that include actual multi-turn human interactions capture benefits that isolated model benchmarks miss.
RAG-assistants paired with humans can support tasks involving compliance and secure data handling without requiring the largest models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Smaller models could be paired with humans to achieve most of the performance benefit at lower computational cost.
The pattern may appear in other interactive AI tasks where human input compensates for model shortcomings.
Direct tests with real sensitive data and legal constraints would be required to verify applicability beyond the lab setting.

Load-bearing premise

The multi-turn information-seeking scenario used in the experiment is representative of real workplace settings that require compliance with local legislation and secure handling of sensitive data.

What would settle it

A follow-up study using actual sensitive compliance queries in which human-AI teams show no performance advantage over model-only systems would falsify the claim of significant collaborative gains.

Figures

Figures reproduced from arXiv: 2605.00964 by Catholijn M. Jonker, Lennard C. Froma, Maaike H.T. de Boer, Max J. van Duijn, Tom Kouwenhoven.

**Figure 1.** Figure 1: An overview of the RAG pipeline. answering (QA) and information seeking in realistic workplace scenarios. Professionals are often tasked with analysing large volumes of documentation. While LLMs hold potential to offer support in this process, their deployment is constrained in cases where data is sensitive and compliance with, for example, the EU AI Act is required. To address such constraints, open, tr… view at source ↗

**Figure 2.** Figure 2: Performance as the accuracy of answers to the 9 test questions across conditions, compared to base-strategy (LLM+RAG) and knowledge base only setting (LLM-only). 4. Results Comparing the question’s evaluation from both independent raters yields a Cohen’s Kappa of .898. This high Cohen’s Kappa score shows little difference between raters’ judgments, which is unsurprising given the unambiguous answers to the… view at source ↗

read the original abstract

Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Human-RAG teams beat model-only baselines in this multi-turn task with little effect from scaling 3B to 70B, but the setup leaves real compliance constraints untested.

read the letter

The core result here is straightforward: adding a human user to a RAG assistant produced a measurable performance gain over the LLM or LLM+RAG baselines, and that gain held across the three model sizes tested. User ratings for usability and satisfaction stayed roughly flat no matter whether the underlying model was 3B, 8B, or 70B. That second part is the more interesting observation for anyone who has to pick a model for a deployed assistant.

Referee Report

3 major / 2 minor

Summary. The paper reports results from a user study (N=112) evaluating RAG-based chatbot assistants in a multi-turn information-seeking task inspired by workplace compliance and sensitive-data scenarios. It compares human-AI collaboration performance and perceptions against LLM-only and LLM+RAG baselines across three model sizes (3B, 8B, 70B), claiming significant performance gains from human assistance that hold irrespective of model size, with little variation in perceived usability or satisfaction across sizes.

Significance. If the results hold, the work provides useful empirical evidence that hybrid human-AI systems can deliver performance benefits in multi-turn information-seeking tasks beyond what standalone LLMs achieve, even with smaller models. The inclusion of both objective performance metrics and subjective usability/satisfaction measures in a realistic collaborative setting, rather than benchmark-only evaluation, is a strength that could inform design of efficient hybrid assistants.

major comments (3)

[§3] §3 (Methodology), Task Description: The scenario is described only as 'inspired by' workplace settings involving compliance with local legislation and secure handling of sensitive data, without specifying how these constraints (e.g., legal accuracy requirements, data-handling rules, or expert-validated ground truth) were operationalized or enforced in the task or evaluation. This is load-bearing for the central claim that hybrid gains generalize to real information-seeking scenarios.
[§4] §4 (Results): The abstract and results state that performance gains are 'significant' and 'irrespective of model size,' yet no details are provided on the statistical tests (e.g., ANOVA, t-tests), p-values, effect sizes, per-condition sample sizes, or error bars/confidence intervals. This prevents verification that the independence from the 3B/8B/70B conditions is robust.
[§4.2] §4.2 (Performance Analysis): The comparison to 'model-only baselines' does not clarify whether the LLM-only condition includes the same RAG retrieval component as the human-assisted condition; if not, the attributed human-AI gain may be confounded by the presence of RAG itself rather than the collaboration dynamic.

minor comments (2)

[Abstract] Abstract: The total N=112 is reported but not broken down by experimental condition or model size, which would improve clarity for interpreting the 'irrespective of model size' result.
[Figures] Figure captions and legends (throughout): Some figures comparing performance across model sizes would benefit from explicit indication of statistical significance markers and sample sizes per bar.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide point-by-point responses to the major comments below, and we will incorporate revisions as indicated.

read point-by-point responses

Referee: §3 (Methodology), Task Description: The scenario is described only as 'inspired by' workplace settings involving compliance with local legislation and secure handling of sensitive data, without specifying how these constraints (e.g., legal accuracy requirements, data-handling rules, or expert-validated ground truth) were operationalized or enforced in the task or evaluation. This is load-bearing for the central claim that hybrid gains generalize to real information-seeking scenarios.

Authors: We agree that more detail on the operationalization of the task constraints is necessary. In the revised manuscript, we will expand the Methodology section to specify how the compliance and sensitive data handling aspects were implemented, including the provision of expert-validated ground truth for accuracy assessment and the specific rules participants and the AI assistant were instructed to adhere to during the interactions. revision: yes
Referee: §4 (Results): The abstract and results state that performance gains are 'significant' and 'irrespective of model size,' yet no details are provided on the statistical tests (e.g., ANOVA, t-tests), p-values, effect sizes, per-condition sample sizes, or error bars/confidence intervals. This prevents verification that the independence from the 3B/8B/70B conditions is robust.

Authors: We apologize for the insufficient statistical reporting in the current version. We will update the Results section to include full details of the statistical tests conducted, including the use of ANOVA to assess effects across model sizes, t-tests for pairwise comparisons, reported p-values, effect sizes, per-condition sample sizes (with total N=112), and the addition of error bars or confidence intervals to figures. revision: yes
Referee: §4.2 (Performance Analysis): The comparison to 'model-only baselines' does not clarify whether the LLM-only condition includes the same RAG retrieval component as the human-assisted condition; if not, the attributed human-AI gain may be confounded by the presence of RAG itself rather than the collaboration dynamic.

Authors: We will revise §4.2 to explicitly clarify the experimental conditions. The human-AI collaboration condition incorporates RAG, the LLM-only baseline does not include RAG, and there is a separate LLM+RAG baseline without human involvement. This design allows us to demonstrate gains from human collaboration over both pure LLM and LLM+RAG conditions, thereby isolating the contribution of the human-AI interaction dynamic. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical user study

full rationale

This paper reports results from a controlled user study (N=112) comparing human-AI collaboration against LLM-only and LLM+RAG baselines in a multi-turn information-seeking task. The headline finding—that human-AI performance gains are significant irrespective of model size (3B/8B/70B)—is obtained directly from measured accuracy, usability, and satisfaction data collected in the experiment. No mathematical derivations, equations, fitted parameters, or predictive models are present that could reduce any claimed result to its inputs by construction. Self-citations, if any, support background or related work but do not carry the load-bearing argument; the evidence is primary experimental data rather than a self-referential chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical measurements from a controlled user study rather than theoretical axioms or derivations; standard assumptions of statistical significance testing are invoked implicitly.

axioms (1)

standard math Standard assumptions underlying statistical significance tests for performance differences in user studies
The abstract states that performance gains are significant without detailing the tests or corrections applied.

pith-pipeline@v0.9.0 · 5553 in / 1316 out tokens · 31093 ms · 2026-05-09T18:29:53.723794+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 1 internal anchor

[1]

A survey on evaluation of large language models

Chang Y , Wang X, Wang J, Wu Y , Yang L, Zhu K, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology. 2024;15(3):1-45

work page 2024
[2]

GPT-NL: Towards a Public Interest Large Language Model

Barbereau T, Dom L. GPT-NL: Towards a Public Interest Large Language Model. In: PI-AI@ KI; 2024. p. n/a

work page 2024
[3]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Asai A, Wu Z, Wang Y , Sil A, Hajishirzi H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In: The Twelfth International Conference on Learning Representations; 2024. p. n/a. Available from:https: //openreview.net/forum?id=hSyW5go0v8

work page 2024
[4]

RankRAG: Uni- fying Context Ranking with Retrieval-Augmented Generation in LLMs

Yu Y , Ping W, Liu Z, Wang B, You J, Zhang C, et al. RankRAG: Uni- fying Context Ranking with Retrieval-Augmented Generation in LLMs. In: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, et al., editors. Advances in Neural Information Processing Systems. vol. 37. Curran Associates, Inc.; 2024. p. 121156-84. Available from: https://proceeding...

work page 2024
[5]

Speculative RAG: En- hancing Retrieval Augmented Generation through Drafting

Wang Z, Wang Z, Le L, Zheng S, Mishra S, Perot V , et al. Speculative RAG: En- hancing Retrieval Augmented Generation through Drafting. In: The Thirteenth In- ternational Conference on Learning Representations; 2025. p. n/a. Available from: https://openreview.net/forum?id=xgQfWbV6Ey

work page 2025
[6]

A Survey on RAG with LLMs

Arslan M, Ghanem H, Munawar S, Cruz C. A Survey on RAG with LLMs. Procedia Computer Science. 2024;246:3781-90

work page 2024
[7]

CRAG - comprehensive RAG benchmark

Yang X, Sun K, Xin H, Sun Y , Bhalla N, Chen X, et al. CRAG - comprehensive RAG benchmark. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. NIPS ’24. Red Hook, NY , USA: Curran Asso- ciates Inc.; 2024. p. n/a

work page 2024
[8]

Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models

Lyu Y , Li Z, Niu S, Xiong F, Tang B, Wang W, et al. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. ACM Transactions on Information Systems. 2025;43(2):1-32

work page 2025
[9]

From matching to generation: A survey on generative information retrieval

Li X, Jin J, Zhou Y , Zhang Y , Zhang P, Zhu Y , et al. From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems. 2025;43(3):1-62

work page 2025
[10]

A re- search agenda for hybrid intelligence: augmenting human intellect with collab- orative, adaptive, responsible, and explainable artificial intelligence

Akata Z, Balliet D, De Rijke M, Dignum F, Dignum V , Eiben G, et al. A re- search agenda for hybrid intelligence: augmenting human intellect with collab- orative, adaptive, responsible, and explainable artificial intelligence. Computer. 2020;53(8):18-28

work page 2020
[11]

Large language models strug- gle to learn long-tail knowledge

Kandpal N, Deng H, Roberts A, Wallace E, Raffel C. Large language models strug- gle to learn long-tail knowledge. In: International Conference on Machine Learn- ing. PMLR; 2023. p. 15696-707

work page 2023
[12]

Webbrain: Learning to gen- erate factually correct articles for queries by grounding on large web corpus

Qian H, Zhu Y , Dou Z, Gu H, Zhang X, Liu Z, et al. Webbrain: Learning to gen- erate factually correct articles for queries by grounding on large web corpus. arXiv preprint arXiv:230404358. 2023

work page 2023
[13]

Webgpt: Browser-assisted question-answering with human feedback

Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:211209332. 2021. May 5, 2026

work page 2021
[14]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V , Goyal N, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural infor- mation processing systems. 2020;33:9459-74

work page 2020
[15]

In-Context Retrieval-Augmented Language Models

Ram O, Levine Y , Dalmedigos I, Muhlgay D, Shashua A, Leyton-Brown K, et al. In-Context Retrieval-Augmented Language Models. Transactions of the Associ- ation for Computational Linguistics. 2023;11:1316-31. Available from:https: //aclanthology.org/2023.tacl-1.75/

work page 2023
[16]

Query rewriting in retrieval-augmented large language models

Ma X, Gong Y , He P, Zhao H, Duan N. Query rewriting in retrieval-augmented large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023. p. 5303-15

work page 2023
[17]

Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation

Dhole KD, Agichtein E. Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation. In: European Conference on Information Retrieval. Springer; 2024. p. 326-35

work page 2024
[18]

Interactive information retrieval

Ruthven I. Interactive information retrieval. Annual review of information science and technology. 2008;42:43-92

work page 2008
[19]

A Survey of Conversational Search

Mo F, Mao K, Zhao Z, Qian H, Chen H, Cheng Y , et al. A Survey of Conversational Search. ACM Trans Inf Syst. 2025 Aug. Just Accepted. Available from:https: //doi.org/10.1145/3759453

work page doi:10.1145/3759453 2025
[20]

Small Language Models are the Future of Agentic AI

Belcak P, Heinrich G, Diao S, Fu Y , Dong X, Muralidharan S, et al. Small Language Models are the Future of Agentic AI. arXiv preprint arXiv:250602153. 2025

work page 2025
[21]

Minirag: Towards extremely simple retrieval- augmented generation

Fan T, Wang J, Ren X, Huang C. Minirag: Towards extremely simple retrieval- augmented generation. arXiv preprint arXiv:250106713. 2025

work page 2025
[22]

A Survey on Retrieval-Augmented Text Generation for Large Language Models

Huang Y , Huang JX. A Survey on Retrieval-Augmented Text Generation for Large Language Models. ACM Comput Surv. 2026 Apr. Just Accepted. Available from: https://doi.org/10.1145/3805774

work page doi:10.1145/3805774 2026
[23]

Scaling retrieval-based language models with a trillion-token datastore

Shao R, He J, Asai A, Shi W, Dettmers T, Min S, et al. Scaling retrieval-based language models with a trillion-token datastore. Advances in Neural Information Processing Systems. 2024;37:91260-99

work page 2024
[24]

When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human- AI Collaboration

Shi Q, Jimenez CE, Yao S, Haber N, Yang D, Narasimhan KR. When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human- AI Collaboration. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems; 2025. p. n/a. Available from:https://openreview.net/ forum?id=9V2SVEl1vP

work page 2025
[25]

Dickerson

Bona FBD, Dominici G, Miller T, Langheinrich M, Gjoreski M. Evaluat- ing Explanations Through LLMs: Beyond Traditional User Studies. CoRR. 2024;abs/2410.17781. Available from:https://doi.org/10.48550/arXiv. 2410.17781

work page internal anchor Pith review doi:10.48550/arxiv 2024
[26]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A Survey on Hal- lucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans Inf Syst. 2025 Jan;43(2). Available from:https: //doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[27]

Go-tuning: Improving zero-shot learning abilities of smaller language models

Xu J, Dong Q, Liu H, Li L. Go-tuning: Improving zero-shot learning abilities of smaller language models. arXiv preprint arXiv:221210461. 2022

work page 2022
[28]

Lost in the Middle: How Language Models Use Long Contexts

Liu NF, Lin K, Hewitt J, Paranjape A, Bevilacqua M, Petroni F, et al. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Asso- ciation for Computational Linguistics. 2024;12:157-73. Available from:https: //aclanthology.org/2024.tacl-1.9/. May 5, 2026

work page 2024
[29]

Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification

Lepagnol P, Gerald T, Ghannay S, Servan C, Rosset S. Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification. In: Calzolari N, Kan MY , Hoste V , Lenci A, Sakti S, Xue N, editors. Proceedings of the 2024 Joint Inter- national Conference on Computational Linguistics, Language Resources and Eval- uation (LREC-COLING 2024). Torino,...

work page 2024
[30]

Rouge: A package for automatic evaluation of summaries

Lin CY . Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out; 2004. p. 74-81

work page 2004
[31]

Bleu: a method for automatic evaluation of machine translation

Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics; 2002. p. 311-8

work page 2002
[32]

Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan

Friel R, Belyi M, Sanyal A. n/a, editor. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. n/a; 2024. Available from:https:// arxiv.org/abs/2407.11005

work page arXiv 2024
[33]

RAGAs: Automated Evalua- tion of Retrieval Augmented Generation

Es S, James J, Espinosa Anke L, Schockaert S. RAGAs: Automated Evalua- tion of Retrieval Augmented Generation. In: Aletras N, De Clercq O, editors. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. St. Julians, Malta: Asso- ciation for Computational Linguistics; 2024. p. 150...

work page 2024
[34]

Position: TrustLLM: Trust- worthiness in Large Language Models

Huang Y , Sun L, Wang H, Wu S, Zhang Q, Li Y , et al. Position: TrustLLM: Trust- worthiness in Large Language Models. In: Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, et al., editors. Proceedings of the 41st International Conference on Machine Learning. vol. 235 of Proceedings of Machine Learning Research. PMLR; 2024. p. 20166-270....

work page 2024
[35]

A comparison of llm finetuning methods & evaluation metrics with travel chatbot use case

Meyer S, Singh S, Tam B, Ton C, Ren A. A comparison of llm finetuning methods & evaluation metrics with travel chatbot use case. arXiv preprint arXiv:240803562. 2024

work page 2024
[36]

Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration

Mangold A, Hoffmann K. Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration. arXiv preprint arXiv:250926205. 2025

work page 2025
[37]

Experimental evidence on the productivity effects of generative artificial intelligence

Noy S, Zhang W. Experimental evidence on the productivity effects of generative artificial intelligence. Science. 2023;381(6654):187-92

work page 2023
[38]

When combinations of humans and AI are useful: A systematic review and meta-analysis

Vaccaro M, Almaatouq A, Malone T. When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour. 2024;8(12):2293-303

work page 2024
[39]

A Tax- onomy for Human-LLM Interaction Modes: An Initial Exploration

Gao J, Gebreegziabher SA, Choo KTW, Li TJJ, Perrault ST, Malone TW. A Tax- onomy for Human-LLM Interaction Modes: An Initial Exploration. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. CHI EA ’24. New York, NY , USA: Association for Computing Machinery; 2024. p. n/a. Available from:https://doi.org/10.1145/3613905.3650786

work page doi:10.1145/3613905.3650786 2024
[40]

Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making; 2022

Schemmer M, Hemmer P, K ¨uhl N, Benz C, Satzger G. Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making; 2022

work page 2022
[41]

From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis

Li Z, Zhu H, Lu Z, Xiao Z, Yin M. From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems; 2025. p. 1-18. May 5, 2026

work page 2025
[42]

Medical artificial intelligence ethics: A systematic review of empirical studies

Tang L, Li J, Fantus S. Medical artificial intelligence ethics: A systematic review of empirical studies. Digital health. 2023;9:20552076231186064

work page 2023
[43]

Ai assistance in legal analysis: An empirical study

Choi JH, Schwarcz D. Ai assistance in legal analysis: An empirical study. J Legal Educ. 2024;73:384

work page 2024
[44]

How to evaluate trust in AI-assisted deci- sion making? A survey of empirical methodologies

Vereschak O, Bailly G, Caramiaux B. How to evaluate trust in AI-assisted deci- sion making? A survey of empirical methodologies. Proceedings of the ACM on Human-Computer Interaction. 2021;5(CSCW2):1-39

work page 2021
[45]

ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

Xu Y , Liu X, Liu X, Hou Z, Li Y , Zhang X, et al. ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline. In: Al- Onaizan Y , Bansal M, Chen YN, editors. Findings of the Association for Compu- tational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Compu- tational Linguistics; 2024. p. 9733-60. ...

work page 2024
[46]

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

Zhang K, Li J, Li G, Shi X, Jin Z. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. In: Ku LW, Martins A, Srikumar V , editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Computationa...

work page 2024
[47]

Factors Affecting Human-Generated AI Collaboration: Trust and Perceived Usefulness as Mediators

Chae HS, Yoon C. Factors Affecting Human-Generated AI Collaboration: Trust and Perceived Usefulness as Mediators. Information. 2025;16(10):856

work page 2025
[48]

Trust in AI: progress, challenges, and future directions

Afroogh S, Akbari A, Malone E, Kargar M, Alambeigi H. Trust in AI: progress, challenges, and future directions. Humanities and Social Sciences Communica- tions. 2024;11(1):1568

work page 2024
[49]

Trust in AI Chatbots: The perceived expertise of CHATGPT in subjective and objective tasks

Ramrath M, Scharmann A, Ridder A, K ¨uhn T, Weller S, Kr ¨amer N. Trust in AI Chatbots: The perceived expertise of CHATGPT in subjective and objective tasks. In: HHAI 2024: hybrid human ai systems for the social good. IOS Press; 2024. p. 264-80

work page 2024
[50]

Developing an interaction framework for human-large language models collaboration in creative tasks: Insights from UX professionals’ communication with ChatGPT

Zhou Z, Gao W, Li Y , Yu J. Developing an interaction framework for human-large language models collaboration in creative tasks: Insights from UX professionals’ communication with ChatGPT. Available at SSRN 4853257. 2024

work page 2024
[51]

Technology acceptance model: TAM

Davis FD, et al. Technology acceptance model: TAM. Al-Suqri, MN, Al-Aufi, AS: Information seeking behavior and technology adoption. 1989;205(219):5

work page 1989
[52]

Does artificial intelligence satisfy you? A meta-analysis of user gratification and user satisfaction with AI-powered chatbots

Xie C, Wang Y , Cheng Y . Does artificial intelligence satisfy you? A meta-analysis of user gratification and user satisfaction with AI-powered chatbots. International Journal of Human–Computer Interaction. 2024;40(3):613-23

work page 2024
[53]

Interpretable User Sat- isfaction Estimation for Conversational Systems with Large Language Models

Lin YC, Neville J, Stokes J, Yang L, Safavi T, Wan M, et al. Interpretable User Sat- isfaction Estimation for Conversational Systems with Large Language Models. In: Ku LW, Martins A, Srikumar V , editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Com...

work page 2024
[54]

A speaker turn-aware multi-task adversarial network for joint user satisfaction estimation and sentiment analy- sis

Song K, Kang Y , Liu J, Li X, Sun C, Liu X. A speaker turn-aware multi-task adversarial network for joint user satisfaction estimation and sentiment analy- sis. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelli- gence and Thirty-Fifth Conference on Innovative Applications of Artificial Intel- May 5, 2026 ligence and Thirteenth Sy...

work page doi:10.1609/aaai.v37i11.26592 2026
[55]

Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation

Wang W, Jiao W, Hao Y , Wang X, Shi S, Tu Z, et al. Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Dublin, Ireland: Association for Computational L...

work page 2022
[56]

MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented gen- eration systems

Katsis Y , Rosenthal S, Fadnis K, Gunasekara C, Lee YS, Popa L, et al. MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented gen- eration systems. Transactions of the Association for Computational Linguistics. 2025;13:784-808

work page 2025
[57]

The llama 3 herd of models

Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al. The llama 3 herd of models. arXiv preprint arXiv:240721783. 2024

work page 2024
[58]

Multilingual e5 text embeddings: A technical report

Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:240205672. 2024

work page 2024
[59]

MTEB: Massive Text Embedding Benchmark

Muennighoff N, Tazi N, Magne L, Reimers N. MTEB: Massive Text Embedding Benchmark. In: Vlachos A, Augenstein I, editors. Proceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia: Association for Computational Linguistics; 2023. p. 2014-37. Available from:https://aclanthology.org/2023...

work page 2023
[60]

n/a, editor

Nanni F, Chan R, Lazauskas T, Geddes J. n/a, editor. Why we still need small lan- guage models – even in the age of frontier AI. n/a; 2025.https://www.turing. ac.uk/blog/why-we-still-need-small-language-models

work page 2025
[61]

Random-effects models for longitudinal data

Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982:963-74

work page 1982
[62]

The artificial-social- agent questionnaire: establishing the long and short questionnaire versions

Fitrianie S, Bruijnes M, Li F, Abdulrahman A, Brinkman WP. The artificial-social- agent questionnaire: establishing the long and short questionnaire versions. In: Pro- ceedings of the 22nd ACM International Conference on Intelligent Virtual Agents

work page
[63]

Use of ranks in one-criterion variance analysis

Kruskal WH, Wallis W A. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association. 1952;47(260):583-621

work page 1952
[64]

Multiple comparisons using rank sums

Dunn OJ. Multiple comparisons using rank sums. Technometrics. 1964;6(3):241- 52

work page 1964
[65]

Bonferroni and ˇSid´ak corrections for multiple comparisons

Abdi H, et al. Bonferroni and ˇSid´ak corrections for multiple comparisons. Ency- clopedia of measurement and statistics. 2007;3(01):2007

work page 2007
[66]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Suzgun M, Scales N, Sch¨arli N, Gehrmann S, Tay Y , Chung HW, et al. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguis- tics; 2023. p. 13003-51. Available from:h...

work page 2023
[67]

Instruction-following evaluation for large language models

Zhou J, Lu T, Mishra S, Brahma S, Basu S, Luan Y , et al. Instruction-following evaluation for large language models. arXiv preprint arXiv:231107911. 2023. May 5, 2026

work page 2023
[68]

Who should i trust: Ai or myself? leveraging human and ai correctness likelihood to promote appropriate trust in ai-assisted decision-making

Ma S, Lei Y , Wang X, Zheng C, Shi C, Yin M, et al. Who should i trust: Ai or myself? leveraging human and ai correctness likelihood to promote appropriate trust in ai-assisted decision-making. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems; 2023. p. 1-19

work page 2023
[69]

HyEnA: A Hybrid Method for Extracting Arguments from Opinions

Van Der Meer M, Liscio E, Jonker CM, Plaat A, V ossen P, Murukannaiah PK. HyEnA: A Hybrid Method for Extracting Arguments from Opinions. In: Schlobach S, Perez-Ortiz M, Tielman M, editors. HHAI2022: Augmenting Human Intellect – Proceedings of the 1st International Conference on Hybrid Human-Artificial Intel- ligence. vol. 354 of Frontiers in Artificial In...

work page
[70]

Query2doc: Query Expansion with Large Language Mod- els

Wang L, Yang N, Wei F. Query2doc: Query Expansion with Large Language Mod- els. In: Bouamor H, Pino J, Bali K, editors. Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing. Singapore: Associ- ation for Computational Linguistics; 2023. p. 9414-23. Available from:https: //aclanthology.org/2023.emnlp-main.585/

work page 2023

[1] [1]

A survey on evaluation of large language models

Chang Y , Wang X, Wang J, Wu Y , Yang L, Zhu K, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology. 2024;15(3):1-45

work page 2024

[2] [2]

GPT-NL: Towards a Public Interest Large Language Model

Barbereau T, Dom L. GPT-NL: Towards a Public Interest Large Language Model. In: PI-AI@ KI; 2024. p. n/a

work page 2024

[3] [3]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Asai A, Wu Z, Wang Y , Sil A, Hajishirzi H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In: The Twelfth International Conference on Learning Representations; 2024. p. n/a. Available from:https: //openreview.net/forum?id=hSyW5go0v8

work page 2024

[4] [4]

RankRAG: Uni- fying Context Ranking with Retrieval-Augmented Generation in LLMs

Yu Y , Ping W, Liu Z, Wang B, You J, Zhang C, et al. RankRAG: Uni- fying Context Ranking with Retrieval-Augmented Generation in LLMs. In: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, et al., editors. Advances in Neural Information Processing Systems. vol. 37. Curran Associates, Inc.; 2024. p. 121156-84. Available from: https://proceeding...

work page 2024

[5] [5]

Speculative RAG: En- hancing Retrieval Augmented Generation through Drafting

Wang Z, Wang Z, Le L, Zheng S, Mishra S, Perot V , et al. Speculative RAG: En- hancing Retrieval Augmented Generation through Drafting. In: The Thirteenth In- ternational Conference on Learning Representations; 2025. p. n/a. Available from: https://openreview.net/forum?id=xgQfWbV6Ey

work page 2025

[6] [6]

A Survey on RAG with LLMs

Arslan M, Ghanem H, Munawar S, Cruz C. A Survey on RAG with LLMs. Procedia Computer Science. 2024;246:3781-90

work page 2024

[7] [7]

CRAG - comprehensive RAG benchmark

Yang X, Sun K, Xin H, Sun Y , Bhalla N, Chen X, et al. CRAG - comprehensive RAG benchmark. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. NIPS ’24. Red Hook, NY , USA: Curran Asso- ciates Inc.; 2024. p. n/a

work page 2024

[8] [8]

Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models

Lyu Y , Li Z, Niu S, Xiong F, Tang B, Wang W, et al. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. ACM Transactions on Information Systems. 2025;43(2):1-32

work page 2025

[9] [9]

From matching to generation: A survey on generative information retrieval

Li X, Jin J, Zhou Y , Zhang Y , Zhang P, Zhu Y , et al. From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems. 2025;43(3):1-62

work page 2025

[10] [10]

A re- search agenda for hybrid intelligence: augmenting human intellect with collab- orative, adaptive, responsible, and explainable artificial intelligence

Akata Z, Balliet D, De Rijke M, Dignum F, Dignum V , Eiben G, et al. A re- search agenda for hybrid intelligence: augmenting human intellect with collab- orative, adaptive, responsible, and explainable artificial intelligence. Computer. 2020;53(8):18-28

work page 2020

[11] [11]

Large language models strug- gle to learn long-tail knowledge

Kandpal N, Deng H, Roberts A, Wallace E, Raffel C. Large language models strug- gle to learn long-tail knowledge. In: International Conference on Machine Learn- ing. PMLR; 2023. p. 15696-707

work page 2023

[12] [12]

Webbrain: Learning to gen- erate factually correct articles for queries by grounding on large web corpus

Qian H, Zhu Y , Dou Z, Gu H, Zhang X, Liu Z, et al. Webbrain: Learning to gen- erate factually correct articles for queries by grounding on large web corpus. arXiv preprint arXiv:230404358. 2023

work page 2023

[13] [13]

Webgpt: Browser-assisted question-answering with human feedback

Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:211209332. 2021. May 5, 2026

work page 2021

[14] [14]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V , Goyal N, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural infor- mation processing systems. 2020;33:9459-74

work page 2020

[15] [15]

In-Context Retrieval-Augmented Language Models

Ram O, Levine Y , Dalmedigos I, Muhlgay D, Shashua A, Leyton-Brown K, et al. In-Context Retrieval-Augmented Language Models. Transactions of the Associ- ation for Computational Linguistics. 2023;11:1316-31. Available from:https: //aclanthology.org/2023.tacl-1.75/

work page 2023

[16] [16]

Query rewriting in retrieval-augmented large language models

Ma X, Gong Y , He P, Zhao H, Duan N. Query rewriting in retrieval-augmented large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023. p. 5303-15

work page 2023

[17] [17]

Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation

Dhole KD, Agichtein E. Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation. In: European Conference on Information Retrieval. Springer; 2024. p. 326-35

work page 2024

[18] [18]

Interactive information retrieval

Ruthven I. Interactive information retrieval. Annual review of information science and technology. 2008;42:43-92

work page 2008

[19] [19]

A Survey of Conversational Search

Mo F, Mao K, Zhao Z, Qian H, Chen H, Cheng Y , et al. A Survey of Conversational Search. ACM Trans Inf Syst. 2025 Aug. Just Accepted. Available from:https: //doi.org/10.1145/3759453

work page doi:10.1145/3759453 2025

[20] [20]

Small Language Models are the Future of Agentic AI

Belcak P, Heinrich G, Diao S, Fu Y , Dong X, Muralidharan S, et al. Small Language Models are the Future of Agentic AI. arXiv preprint arXiv:250602153. 2025

work page 2025

[21] [21]

Minirag: Towards extremely simple retrieval- augmented generation

Fan T, Wang J, Ren X, Huang C. Minirag: Towards extremely simple retrieval- augmented generation. arXiv preprint arXiv:250106713. 2025

work page 2025

[22] [22]

A Survey on Retrieval-Augmented Text Generation for Large Language Models

Huang Y , Huang JX. A Survey on Retrieval-Augmented Text Generation for Large Language Models. ACM Comput Surv. 2026 Apr. Just Accepted. Available from: https://doi.org/10.1145/3805774

work page doi:10.1145/3805774 2026

[23] [23]

Scaling retrieval-based language models with a trillion-token datastore

Shao R, He J, Asai A, Shi W, Dettmers T, Min S, et al. Scaling retrieval-based language models with a trillion-token datastore. Advances in Neural Information Processing Systems. 2024;37:91260-99

work page 2024

[24] [24]

When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human- AI Collaboration

Shi Q, Jimenez CE, Yao S, Haber N, Yang D, Narasimhan KR. When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human- AI Collaboration. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems; 2025. p. n/a. Available from:https://openreview.net/ forum?id=9V2SVEl1vP

work page 2025

[25] [25]

Dickerson

Bona FBD, Dominici G, Miller T, Langheinrich M, Gjoreski M. Evaluat- ing Explanations Through LLMs: Beyond Traditional User Studies. CoRR. 2024;abs/2410.17781. Available from:https://doi.org/10.48550/arXiv. 2410.17781

work page internal anchor Pith review doi:10.48550/arxiv 2024

[26] [26]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A Survey on Hal- lucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans Inf Syst. 2025 Jan;43(2). Available from:https: //doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025

[27] [27]

Go-tuning: Improving zero-shot learning abilities of smaller language models

Xu J, Dong Q, Liu H, Li L. Go-tuning: Improving zero-shot learning abilities of smaller language models. arXiv preprint arXiv:221210461. 2022

work page 2022

[28] [28]

Lost in the Middle: How Language Models Use Long Contexts

Liu NF, Lin K, Hewitt J, Paranjape A, Bevilacqua M, Petroni F, et al. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Asso- ciation for Computational Linguistics. 2024;12:157-73. Available from:https: //aclanthology.org/2024.tacl-1.9/. May 5, 2026

work page 2024

[29] [29]

Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification

Lepagnol P, Gerald T, Ghannay S, Servan C, Rosset S. Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification. In: Calzolari N, Kan MY , Hoste V , Lenci A, Sakti S, Xue N, editors. Proceedings of the 2024 Joint Inter- national Conference on Computational Linguistics, Language Resources and Eval- uation (LREC-COLING 2024). Torino,...

work page 2024

[30] [30]

Rouge: A package for automatic evaluation of summaries

Lin CY . Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out; 2004. p. 74-81

work page 2004

[31] [31]

Bleu: a method for automatic evaluation of machine translation

Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics; 2002. p. 311-8

work page 2002

[32] [32]

Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan

Friel R, Belyi M, Sanyal A. n/a, editor. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. n/a; 2024. Available from:https:// arxiv.org/abs/2407.11005

work page arXiv 2024

[33] [33]

RAGAs: Automated Evalua- tion of Retrieval Augmented Generation

Es S, James J, Espinosa Anke L, Schockaert S. RAGAs: Automated Evalua- tion of Retrieval Augmented Generation. In: Aletras N, De Clercq O, editors. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. St. Julians, Malta: Asso- ciation for Computational Linguistics; 2024. p. 150...

work page 2024

[34] [34]

Position: TrustLLM: Trust- worthiness in Large Language Models

Huang Y , Sun L, Wang H, Wu S, Zhang Q, Li Y , et al. Position: TrustLLM: Trust- worthiness in Large Language Models. In: Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, et al., editors. Proceedings of the 41st International Conference on Machine Learning. vol. 235 of Proceedings of Machine Learning Research. PMLR; 2024. p. 20166-270....

work page 2024

[35] [35]

A comparison of llm finetuning methods & evaluation metrics with travel chatbot use case

Meyer S, Singh S, Tam B, Ton C, Ren A. A comparison of llm finetuning methods & evaluation metrics with travel chatbot use case. arXiv preprint arXiv:240803562. 2024

work page 2024

[36] [36]

Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration

Mangold A, Hoffmann K. Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration. arXiv preprint arXiv:250926205. 2025

work page 2025

[37] [37]

Experimental evidence on the productivity effects of generative artificial intelligence

Noy S, Zhang W. Experimental evidence on the productivity effects of generative artificial intelligence. Science. 2023;381(6654):187-92

work page 2023

[38] [38]

When combinations of humans and AI are useful: A systematic review and meta-analysis

Vaccaro M, Almaatouq A, Malone T. When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour. 2024;8(12):2293-303

work page 2024

[39] [39]

A Tax- onomy for Human-LLM Interaction Modes: An Initial Exploration

Gao J, Gebreegziabher SA, Choo KTW, Li TJJ, Perrault ST, Malone TW. A Tax- onomy for Human-LLM Interaction Modes: An Initial Exploration. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. CHI EA ’24. New York, NY , USA: Association for Computing Machinery; 2024. p. n/a. Available from:https://doi.org/10.1145/3613905.3650786

work page doi:10.1145/3613905.3650786 2024

[40] [40]

Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making; 2022

Schemmer M, Hemmer P, K ¨uhl N, Benz C, Satzger G. Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making; 2022

work page 2022

[41] [41]

From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis

Li Z, Zhu H, Lu Z, Xiao Z, Yin M. From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems; 2025. p. 1-18. May 5, 2026

work page 2025

[42] [42]

Medical artificial intelligence ethics: A systematic review of empirical studies

Tang L, Li J, Fantus S. Medical artificial intelligence ethics: A systematic review of empirical studies. Digital health. 2023;9:20552076231186064

work page 2023

[43] [43]

Ai assistance in legal analysis: An empirical study

Choi JH, Schwarcz D. Ai assistance in legal analysis: An empirical study. J Legal Educ. 2024;73:384

work page 2024

[44] [44]

How to evaluate trust in AI-assisted deci- sion making? A survey of empirical methodologies

Vereschak O, Bailly G, Caramiaux B. How to evaluate trust in AI-assisted deci- sion making? A survey of empirical methodologies. Proceedings of the ACM on Human-Computer Interaction. 2021;5(CSCW2):1-39

work page 2021

[45] [45]

ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

Xu Y , Liu X, Liu X, Hou Z, Li Y , Zhang X, et al. ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline. In: Al- Onaizan Y , Bansal M, Chen YN, editors. Findings of the Association for Compu- tational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Compu- tational Linguistics; 2024. p. 9733-60. ...

work page 2024

[46] [46]

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

Zhang K, Li J, Li G, Shi X, Jin Z. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. In: Ku LW, Martins A, Srikumar V , editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Computationa...

work page 2024

[47] [47]

Factors Affecting Human-Generated AI Collaboration: Trust and Perceived Usefulness as Mediators

Chae HS, Yoon C. Factors Affecting Human-Generated AI Collaboration: Trust and Perceived Usefulness as Mediators. Information. 2025;16(10):856

work page 2025

[48] [48]

Trust in AI: progress, challenges, and future directions

Afroogh S, Akbari A, Malone E, Kargar M, Alambeigi H. Trust in AI: progress, challenges, and future directions. Humanities and Social Sciences Communica- tions. 2024;11(1):1568

work page 2024

[49] [49]

Trust in AI Chatbots: The perceived expertise of CHATGPT in subjective and objective tasks

Ramrath M, Scharmann A, Ridder A, K ¨uhn T, Weller S, Kr ¨amer N. Trust in AI Chatbots: The perceived expertise of CHATGPT in subjective and objective tasks. In: HHAI 2024: hybrid human ai systems for the social good. IOS Press; 2024. p. 264-80

work page 2024

[50] [50]

Developing an interaction framework for human-large language models collaboration in creative tasks: Insights from UX professionals’ communication with ChatGPT

Zhou Z, Gao W, Li Y , Yu J. Developing an interaction framework for human-large language models collaboration in creative tasks: Insights from UX professionals’ communication with ChatGPT. Available at SSRN 4853257. 2024

work page 2024

[51] [51]

Technology acceptance model: TAM

Davis FD, et al. Technology acceptance model: TAM. Al-Suqri, MN, Al-Aufi, AS: Information seeking behavior and technology adoption. 1989;205(219):5

work page 1989

[52] [52]

Does artificial intelligence satisfy you? A meta-analysis of user gratification and user satisfaction with AI-powered chatbots

Xie C, Wang Y , Cheng Y . Does artificial intelligence satisfy you? A meta-analysis of user gratification and user satisfaction with AI-powered chatbots. International Journal of Human–Computer Interaction. 2024;40(3):613-23

work page 2024

[53] [53]

Interpretable User Sat- isfaction Estimation for Conversational Systems with Large Language Models

Lin YC, Neville J, Stokes J, Yang L, Safavi T, Wan M, et al. Interpretable User Sat- isfaction Estimation for Conversational Systems with Large Language Models. In: Ku LW, Martins A, Srikumar V , editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Com...

work page 2024

[54] [54]

A speaker turn-aware multi-task adversarial network for joint user satisfaction estimation and sentiment analy- sis

Song K, Kang Y , Liu J, Li X, Sun C, Liu X. A speaker turn-aware multi-task adversarial network for joint user satisfaction estimation and sentiment analy- sis. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelli- gence and Thirty-Fifth Conference on Innovative Applications of Artificial Intel- May 5, 2026 ligence and Thirteenth Sy...

work page doi:10.1609/aaai.v37i11.26592 2026

[55] [55]

Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation

Wang W, Jiao W, Hao Y , Wang X, Shi S, Tu Z, et al. Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Dublin, Ireland: Association for Computational L...

work page 2022

[56] [56]

MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented gen- eration systems

Katsis Y , Rosenthal S, Fadnis K, Gunasekara C, Lee YS, Popa L, et al. MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented gen- eration systems. Transactions of the Association for Computational Linguistics. 2025;13:784-808

work page 2025

[57] [57]

The llama 3 herd of models

Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al. The llama 3 herd of models. arXiv preprint arXiv:240721783. 2024

work page 2024

[58] [58]

Multilingual e5 text embeddings: A technical report

Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:240205672. 2024

work page 2024

[59] [59]

MTEB: Massive Text Embedding Benchmark

Muennighoff N, Tazi N, Magne L, Reimers N. MTEB: Massive Text Embedding Benchmark. In: Vlachos A, Augenstein I, editors. Proceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia: Association for Computational Linguistics; 2023. p. 2014-37. Available from:https://aclanthology.org/2023...

work page 2023

[60] [60]

n/a, editor

Nanni F, Chan R, Lazauskas T, Geddes J. n/a, editor. Why we still need small lan- guage models – even in the age of frontier AI. n/a; 2025.https://www.turing. ac.uk/blog/why-we-still-need-small-language-models

work page 2025

[61] [61]

Random-effects models for longitudinal data

Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982:963-74

work page 1982

[62] [62]

The artificial-social- agent questionnaire: establishing the long and short questionnaire versions

Fitrianie S, Bruijnes M, Li F, Abdulrahman A, Brinkman WP. The artificial-social- agent questionnaire: establishing the long and short questionnaire versions. In: Pro- ceedings of the 22nd ACM International Conference on Intelligent Virtual Agents

work page

[63] [63]

Use of ranks in one-criterion variance analysis

Kruskal WH, Wallis W A. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association. 1952;47(260):583-621

work page 1952

[64] [64]

Multiple comparisons using rank sums

Dunn OJ. Multiple comparisons using rank sums. Technometrics. 1964;6(3):241- 52

work page 1964

[65] [65]

Bonferroni and ˇSid´ak corrections for multiple comparisons

Abdi H, et al. Bonferroni and ˇSid´ak corrections for multiple comparisons. Ency- clopedia of measurement and statistics. 2007;3(01):2007

work page 2007

[66] [66]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Suzgun M, Scales N, Sch¨arli N, Gehrmann S, Tay Y , Chung HW, et al. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguis- tics; 2023. p. 13003-51. Available from:h...

work page 2023

[67] [67]

Instruction-following evaluation for large language models

Zhou J, Lu T, Mishra S, Brahma S, Basu S, Luan Y , et al. Instruction-following evaluation for large language models. arXiv preprint arXiv:231107911. 2023. May 5, 2026

work page 2023

[68] [68]

Who should i trust: Ai or myself? leveraging human and ai correctness likelihood to promote appropriate trust in ai-assisted decision-making

Ma S, Lei Y , Wang X, Zheng C, Shi C, Yin M, et al. Who should i trust: Ai or myself? leveraging human and ai correctness likelihood to promote appropriate trust in ai-assisted decision-making. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems; 2023. p. 1-19

work page 2023

[69] [69]

HyEnA: A Hybrid Method for Extracting Arguments from Opinions

Van Der Meer M, Liscio E, Jonker CM, Plaat A, V ossen P, Murukannaiah PK. HyEnA: A Hybrid Method for Extracting Arguments from Opinions. In: Schlobach S, Perez-Ortiz M, Tielman M, editors. HHAI2022: Augmenting Human Intellect – Proceedings of the 1st International Conference on Hybrid Human-Artificial Intel- ligence. vol. 354 of Frontiers in Artificial In...

work page

[70] [70]

Query2doc: Query Expansion with Large Language Mod- els

Wang L, Yang N, Wei F. Query2doc: Query Expansion with Large Language Mod- els. In: Bouamor H, Pino J, Bali K, editors. Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing. Singapore: Associ- ation for Computational Linguistics; 2023. p. 9414-23. Available from:https: //aclanthology.org/2023.emnlp-main.585/

work page 2023