Seeking Information with RAG-Assistants: Does Model Size Matter in Human-AI Collaborations?
Pith reviewed 2026-05-09 18:29 UTC · model grok-4.3
The pith
Humans collaborating with RAG-assistants achieve significantly better results than the models alone, regardless of model size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a controlled experiment with 112 participants, human-AI collaboration with RAG-assistants produced a significant performance gain over model-only baselines in a multi-turn information-seeking scenario, and this gain held steady whether the underlying model had 3B, 8B, or 70B parameters. Perceived usability and satisfaction showed little difference across model sizes.
What carries the argument
The multi-turn human-AI collaborative dynamic with RAG-assistants, where humans supply context and oversight during information retrieval and response generation.
If this is right
- Hybrid human-AI systems deliver performance gains in information-seeking tasks that remain consistent across model scales from 3B to 70B.
- User ratings of usability and satisfaction do not increase meaningfully with larger models during collaborative use.
- Evaluations that include actual multi-turn human interactions capture benefits that isolated model benchmarks miss.
- RAG-assistants paired with humans can support tasks involving compliance and secure data handling without requiring the largest models.
Where Pith is reading between the lines
- Smaller models could be paired with humans to achieve most of the performance benefit at lower computational cost.
- The pattern may appear in other interactive AI tasks where human input compensates for model shortcomings.
- Direct tests with real sensitive data and legal constraints would be required to verify applicability beyond the lab setting.
Load-bearing premise
The multi-turn information-seeking scenario used in the experiment is representative of real workplace settings that require compliance with local legislation and secure handling of sensitive data.
What would settle it
A follow-up study using actual sensitive compliance queries in which human-AI teams show no performance advantage over model-only systems would falsify the claim of significant collaborative gains.
Figures
read the original abstract
Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports results from a user study (N=112) evaluating RAG-based chatbot assistants in a multi-turn information-seeking task inspired by workplace compliance and sensitive-data scenarios. It compares human-AI collaboration performance and perceptions against LLM-only and LLM+RAG baselines across three model sizes (3B, 8B, 70B), claiming significant performance gains from human assistance that hold irrespective of model size, with little variation in perceived usability or satisfaction across sizes.
Significance. If the results hold, the work provides useful empirical evidence that hybrid human-AI systems can deliver performance benefits in multi-turn information-seeking tasks beyond what standalone LLMs achieve, even with smaller models. The inclusion of both objective performance metrics and subjective usability/satisfaction measures in a realistic collaborative setting, rather than benchmark-only evaluation, is a strength that could inform design of efficient hybrid assistants.
major comments (3)
- [§3] §3 (Methodology), Task Description: The scenario is described only as 'inspired by' workplace settings involving compliance with local legislation and secure handling of sensitive data, without specifying how these constraints (e.g., legal accuracy requirements, data-handling rules, or expert-validated ground truth) were operationalized or enforced in the task or evaluation. This is load-bearing for the central claim that hybrid gains generalize to real information-seeking scenarios.
- [§4] §4 (Results): The abstract and results state that performance gains are 'significant' and 'irrespective of model size,' yet no details are provided on the statistical tests (e.g., ANOVA, t-tests), p-values, effect sizes, per-condition sample sizes, or error bars/confidence intervals. This prevents verification that the independence from the 3B/8B/70B conditions is robust.
- [§4.2] §4.2 (Performance Analysis): The comparison to 'model-only baselines' does not clarify whether the LLM-only condition includes the same RAG retrieval component as the human-assisted condition; if not, the attributed human-AI gain may be confounded by the presence of RAG itself rather than the collaboration dynamic.
minor comments (2)
- [Abstract] Abstract: The total N=112 is reported but not broken down by experimental condition or model size, which would improve clarity for interpreting the 'irrespective of model size' result.
- [Figures] Figure captions and legends (throughout): Some figures comparing performance across model sizes would benefit from explicit indication of statistical significance markers and sample sizes per bar.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We provide point-by-point responses to the major comments below, and we will incorporate revisions as indicated.
read point-by-point responses
-
Referee: §3 (Methodology), Task Description: The scenario is described only as 'inspired by' workplace settings involving compliance with local legislation and secure handling of sensitive data, without specifying how these constraints (e.g., legal accuracy requirements, data-handling rules, or expert-validated ground truth) were operationalized or enforced in the task or evaluation. This is load-bearing for the central claim that hybrid gains generalize to real information-seeking scenarios.
Authors: We agree that more detail on the operationalization of the task constraints is necessary. In the revised manuscript, we will expand the Methodology section to specify how the compliance and sensitive data handling aspects were implemented, including the provision of expert-validated ground truth for accuracy assessment and the specific rules participants and the AI assistant were instructed to adhere to during the interactions. revision: yes
-
Referee: §4 (Results): The abstract and results state that performance gains are 'significant' and 'irrespective of model size,' yet no details are provided on the statistical tests (e.g., ANOVA, t-tests), p-values, effect sizes, per-condition sample sizes, or error bars/confidence intervals. This prevents verification that the independence from the 3B/8B/70B conditions is robust.
Authors: We apologize for the insufficient statistical reporting in the current version. We will update the Results section to include full details of the statistical tests conducted, including the use of ANOVA to assess effects across model sizes, t-tests for pairwise comparisons, reported p-values, effect sizes, per-condition sample sizes (with total N=112), and the addition of error bars or confidence intervals to figures. revision: yes
-
Referee: §4.2 (Performance Analysis): The comparison to 'model-only baselines' does not clarify whether the LLM-only condition includes the same RAG retrieval component as the human-assisted condition; if not, the attributed human-AI gain may be confounded by the presence of RAG itself rather than the collaboration dynamic.
Authors: We will revise §4.2 to explicitly clarify the experimental conditions. The human-AI collaboration condition incorporates RAG, the LLM-only baseline does not include RAG, and there is a separate LLM+RAG baseline without human involvement. This design allows us to demonstrate gains from human collaboration over both pure LLM and LLM+RAG conditions, thereby isolating the contribution of the human-AI interaction dynamic. revision: yes
Circularity Check
No significant circularity in empirical user study
full rationale
This paper reports results from a controlled user study (N=112) comparing human-AI collaboration against LLM-only and LLM+RAG baselines in a multi-turn information-seeking task. The headline finding—that human-AI performance gains are significant irrespective of model size (3B/8B/70B)—is obtained directly from measured accuracy, usability, and satisfaction data collected in the experiment. No mathematical derivations, equations, fitted parameters, or predictive models are present that could reduce any claimed result to its inputs by construction. Self-citations, if any, support background or related work but do not carry the load-bearing argument; the evidence is primary experimental data rather than a self-referential chain.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions underlying statistical significance tests for performance differences in user studies
Reference graph
Works this paper leans on
-
[1]
A survey on evaluation of large language models
Chang Y , Wang X, Wang J, Wu Y , Yang L, Zhu K, et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology. 2024;15(3):1-45
work page 2024
-
[2]
GPT-NL: Towards a Public Interest Large Language Model
Barbereau T, Dom L. GPT-NL: Towards a Public Interest Large Language Model. In: PI-AI@ KI; 2024. p. n/a
work page 2024
-
[3]
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Asai A, Wu Z, Wang Y , Sil A, Hajishirzi H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In: The Twelfth International Conference on Learning Representations; 2024. p. n/a. Available from:https: //openreview.net/forum?id=hSyW5go0v8
work page 2024
-
[4]
RankRAG: Uni- fying Context Ranking with Retrieval-Augmented Generation in LLMs
Yu Y , Ping W, Liu Z, Wang B, You J, Zhang C, et al. RankRAG: Uni- fying Context Ranking with Retrieval-Augmented Generation in LLMs. In: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, et al., editors. Advances in Neural Information Processing Systems. vol. 37. Curran Associates, Inc.; 2024. p. 121156-84. Available from: https://proceeding...
work page 2024
-
[5]
Speculative RAG: En- hancing Retrieval Augmented Generation through Drafting
Wang Z, Wang Z, Le L, Zheng S, Mishra S, Perot V , et al. Speculative RAG: En- hancing Retrieval Augmented Generation through Drafting. In: The Thirteenth In- ternational Conference on Learning Representations; 2025. p. n/a. Available from: https://openreview.net/forum?id=xgQfWbV6Ey
work page 2025
-
[6]
Arslan M, Ghanem H, Munawar S, Cruz C. A Survey on RAG with LLMs. Procedia Computer Science. 2024;246:3781-90
work page 2024
-
[7]
CRAG - comprehensive RAG benchmark
Yang X, Sun K, Xin H, Sun Y , Bhalla N, Chen X, et al. CRAG - comprehensive RAG benchmark. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. NIPS ’24. Red Hook, NY , USA: Curran Asso- ciates Inc.; 2024. p. n/a
work page 2024
-
[8]
Lyu Y , Li Z, Niu S, Xiong F, Tang B, Wang W, et al. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. ACM Transactions on Information Systems. 2025;43(2):1-32
work page 2025
-
[9]
From matching to generation: A survey on generative information retrieval
Li X, Jin J, Zhou Y , Zhang Y , Zhang P, Zhu Y , et al. From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems. 2025;43(3):1-62
work page 2025
-
[10]
Akata Z, Balliet D, De Rijke M, Dignum F, Dignum V , Eiben G, et al. A re- search agenda for hybrid intelligence: augmenting human intellect with collab- orative, adaptive, responsible, and explainable artificial intelligence. Computer. 2020;53(8):18-28
work page 2020
-
[11]
Large language models strug- gle to learn long-tail knowledge
Kandpal N, Deng H, Roberts A, Wallace E, Raffel C. Large language models strug- gle to learn long-tail knowledge. In: International Conference on Machine Learn- ing. PMLR; 2023. p. 15696-707
work page 2023
-
[12]
Qian H, Zhu Y , Dou Z, Gu H, Zhang X, Liu Z, et al. Webbrain: Learning to gen- erate factually correct articles for queries by grounding on large web corpus. arXiv preprint arXiv:230404358. 2023
work page 2023
-
[13]
Webgpt: Browser-assisted question-answering with human feedback
Nakano R, Hilton J, Balaji S, Wu J, Ouyang L, Kim C, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:211209332. 2021. May 5, 2026
work page 2021
-
[14]
Retrieval- augmented generation for knowledge-intensive nlp tasks
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V , Goyal N, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural infor- mation processing systems. 2020;33:9459-74
work page 2020
-
[15]
In-Context Retrieval-Augmented Language Models
Ram O, Levine Y , Dalmedigos I, Muhlgay D, Shashua A, Leyton-Brown K, et al. In-Context Retrieval-Augmented Language Models. Transactions of the Associ- ation for Computational Linguistics. 2023;11:1316-31. Available from:https: //aclanthology.org/2023.tacl-1.75/
work page 2023
-
[16]
Query rewriting in retrieval-augmented large language models
Ma X, Gong Y , He P, Zhao H, Duan N. Query rewriting in retrieval-augmented large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023. p. 5303-15
work page 2023
-
[17]
Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation
Dhole KD, Agichtein E. Genqrensemble: Zero-shot llm ensemble prompting for generative query reformulation. In: European Conference on Information Retrieval. Springer; 2024. p. 326-35
work page 2024
-
[18]
Interactive information retrieval
Ruthven I. Interactive information retrieval. Annual review of information science and technology. 2008;42:43-92
work page 2008
-
[19]
A Survey of Conversational Search
Mo F, Mao K, Zhao Z, Qian H, Chen H, Cheng Y , et al. A Survey of Conversational Search. ACM Trans Inf Syst. 2025 Aug. Just Accepted. Available from:https: //doi.org/10.1145/3759453
-
[20]
Small Language Models are the Future of Agentic AI
Belcak P, Heinrich G, Diao S, Fu Y , Dong X, Muralidharan S, et al. Small Language Models are the Future of Agentic AI. arXiv preprint arXiv:250602153. 2025
work page 2025
-
[21]
Minirag: Towards extremely simple retrieval- augmented generation
Fan T, Wang J, Ren X, Huang C. Minirag: Towards extremely simple retrieval- augmented generation. arXiv preprint arXiv:250106713. 2025
work page 2025
-
[22]
A Survey on Retrieval-Augmented Text Generation for Large Language Models
Huang Y , Huang JX. A Survey on Retrieval-Augmented Text Generation for Large Language Models. ACM Comput Surv. 2026 Apr. Just Accepted. Available from: https://doi.org/10.1145/3805774
-
[23]
Scaling retrieval-based language models with a trillion-token datastore
Shao R, He J, Asai A, Shi W, Dettmers T, Min S, et al. Scaling retrieval-based language models with a trillion-token datastore. Advances in Neural Information Processing Systems. 2024;37:91260-99
work page 2024
-
[24]
Shi Q, Jimenez CE, Yao S, Haber N, Yang D, Narasimhan KR. When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human- AI Collaboration. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems; 2025. p. n/a. Available from:https://openreview.net/ forum?id=9V2SVEl1vP
work page 2025
-
[25]
Bona FBD, Dominici G, Miller T, Langheinrich M, Gjoreski M. Evaluat- ing Explanations Through LLMs: Beyond Traditional User Studies. CoRR. 2024;abs/2410.17781. Available from:https://doi.org/10.48550/arXiv. 2410.17781
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[26]
Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A Survey on Hal- lucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans Inf Syst. 2025 Jan;43(2). Available from:https: //doi.org/10.1145/3703155
-
[27]
Go-tuning: Improving zero-shot learning abilities of smaller language models
Xu J, Dong Q, Liu H, Li L. Go-tuning: Improving zero-shot learning abilities of smaller language models. arXiv preprint arXiv:221210461. 2022
work page 2022
-
[28]
Lost in the Middle: How Language Models Use Long Contexts
Liu NF, Lin K, Hewitt J, Paranjape A, Bevilacqua M, Petroni F, et al. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Asso- ciation for Computational Linguistics. 2024;12:157-73. Available from:https: //aclanthology.org/2024.tacl-1.9/. May 5, 2026
work page 2024
-
[29]
Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification
Lepagnol P, Gerald T, Ghannay S, Servan C, Rosset S. Small Language Models Are Good Too: An Empirical Study of Zero-Shot Classification. In: Calzolari N, Kan MY , Hoste V , Lenci A, Sakti S, Xue N, editors. Proceedings of the 2024 Joint Inter- national Conference on Computational Linguistics, Language Resources and Eval- uation (LREC-COLING 2024). Torino,...
work page 2024
-
[30]
Rouge: A package for automatic evaluation of summaries
Lin CY . Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out; 2004. p. 74-81
work page 2004
-
[31]
Bleu: a method for automatic evaluation of machine translation
Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics; 2002. p. 311-8
work page 2002
-
[32]
Ruiliu Fu, Han Wang, Xuejun Zhang, Jun Zhou, and Yonghong Yan
Friel R, Belyi M, Sanyal A. n/a, editor. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. n/a; 2024. Available from:https:// arxiv.org/abs/2407.11005
-
[33]
RAGAs: Automated Evalua- tion of Retrieval Augmented Generation
Es S, James J, Espinosa Anke L, Schockaert S. RAGAs: Automated Evalua- tion of Retrieval Augmented Generation. In: Aletras N, De Clercq O, editors. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. St. Julians, Malta: Asso- ciation for Computational Linguistics; 2024. p. 150...
work page 2024
-
[34]
Position: TrustLLM: Trust- worthiness in Large Language Models
Huang Y , Sun L, Wang H, Wu S, Zhang Q, Li Y , et al. Position: TrustLLM: Trust- worthiness in Large Language Models. In: Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, et al., editors. Proceedings of the 41st International Conference on Machine Learning. vol. 235 of Proceedings of Machine Learning Research. PMLR; 2024. p. 20166-270....
work page 2024
-
[35]
A comparison of llm finetuning methods & evaluation metrics with travel chatbot use case
Meyer S, Singh S, Tam B, Ton C, Ren A. A comparison of llm finetuning methods & evaluation metrics with travel chatbot use case. arXiv preprint arXiv:240803562. 2024
work page 2024
-
[36]
Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration
Mangold A, Hoffmann K. Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration. arXiv preprint arXiv:250926205. 2025
work page 2025
-
[37]
Experimental evidence on the productivity effects of generative artificial intelligence
Noy S, Zhang W. Experimental evidence on the productivity effects of generative artificial intelligence. Science. 2023;381(6654):187-92
work page 2023
-
[38]
When combinations of humans and AI are useful: A systematic review and meta-analysis
Vaccaro M, Almaatouq A, Malone T. When combinations of humans and AI are useful: A systematic review and meta-analysis. Nature Human Behaviour. 2024;8(12):2293-303
work page 2024
-
[39]
A Tax- onomy for Human-LLM Interaction Modes: An Initial Exploration
Gao J, Gebreegziabher SA, Choo KTW, Li TJJ, Perrault ST, Malone TW. A Tax- onomy for Human-LLM Interaction Modes: An Initial Exploration. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. CHI EA ’24. New York, NY , USA: Association for Computing Machinery; 2024. p. n/a. Available from:https://doi.org/10.1145/3613905.3650786
-
[40]
Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making; 2022
Schemmer M, Hemmer P, K ¨uhl N, Benz C, Satzger G. Should I Follow AI-based Advice? Measuring Appropriate Reliance in Human-AI Decision-Making; 2022
work page 2022
-
[41]
From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis
Li Z, Zhu H, Lu Z, Xiao Z, Yin M. From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems; 2025. p. 1-18. May 5, 2026
work page 2025
-
[42]
Medical artificial intelligence ethics: A systematic review of empirical studies
Tang L, Li J, Fantus S. Medical artificial intelligence ethics: A systematic review of empirical studies. Digital health. 2023;9:20552076231186064
work page 2023
-
[43]
Ai assistance in legal analysis: An empirical study
Choi JH, Schwarcz D. Ai assistance in legal analysis: An empirical study. J Legal Educ. 2024;73:384
work page 2024
-
[44]
How to evaluate trust in AI-assisted deci- sion making? A survey of empirical methodologies
Vereschak O, Bailly G, Caramiaux B. How to evaluate trust in AI-assisted deci- sion making? A survey of empirical methodologies. Proceedings of the ACM on Human-Computer Interaction. 2021;5(CSCW2):1-39
work page 2021
-
[45]
ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline
Xu Y , Liu X, Liu X, Hou Z, Li Y , Zhang X, et al. ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline. In: Al- Onaizan Y , Bansal M, Chen YN, editors. Findings of the Association for Compu- tational Linguistics: EMNLP 2024. Miami, Florida, USA: Association for Compu- tational Linguistics; 2024. p. 9733-60. ...
work page 2024
-
[46]
Zhang K, Li J, Li G, Shi X, Jin Z. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. In: Ku LW, Martins A, Srikumar V , editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Computationa...
work page 2024
-
[47]
Factors Affecting Human-Generated AI Collaboration: Trust and Perceived Usefulness as Mediators
Chae HS, Yoon C. Factors Affecting Human-Generated AI Collaboration: Trust and Perceived Usefulness as Mediators. Information. 2025;16(10):856
work page 2025
-
[48]
Trust in AI: progress, challenges, and future directions
Afroogh S, Akbari A, Malone E, Kargar M, Alambeigi H. Trust in AI: progress, challenges, and future directions. Humanities and Social Sciences Communica- tions. 2024;11(1):1568
work page 2024
-
[49]
Trust in AI Chatbots: The perceived expertise of CHATGPT in subjective and objective tasks
Ramrath M, Scharmann A, Ridder A, K ¨uhn T, Weller S, Kr ¨amer N. Trust in AI Chatbots: The perceived expertise of CHATGPT in subjective and objective tasks. In: HHAI 2024: hybrid human ai systems for the social good. IOS Press; 2024. p. 264-80
work page 2024
-
[50]
Zhou Z, Gao W, Li Y , Yu J. Developing an interaction framework for human-large language models collaboration in creative tasks: Insights from UX professionals’ communication with ChatGPT. Available at SSRN 4853257. 2024
work page 2024
-
[51]
Technology acceptance model: TAM
Davis FD, et al. Technology acceptance model: TAM. Al-Suqri, MN, Al-Aufi, AS: Information seeking behavior and technology adoption. 1989;205(219):5
work page 1989
-
[52]
Xie C, Wang Y , Cheng Y . Does artificial intelligence satisfy you? A meta-analysis of user gratification and user satisfaction with AI-powered chatbots. International Journal of Human–Computer Interaction. 2024;40(3):613-23
work page 2024
-
[53]
Interpretable User Sat- isfaction Estimation for Conversational Systems with Large Language Models
Lin YC, Neville J, Stokes J, Yang L, Safavi T, Wan M, et al. Interpretable User Sat- isfaction Estimation for Conversational Systems with Large Language Models. In: Ku LW, Martins A, Srikumar V , editors. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Com...
work page 2024
-
[54]
Song K, Kang Y , Liu J, Li X, Sun C, Liu X. A speaker turn-aware multi-task adversarial network for joint user satisfaction estimation and sentiment analy- sis. In: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelli- gence and Thirty-Fifth Conference on Innovative Applications of Artificial Intel- May 5, 2026 ligence and Thirteenth Sy...
-
[55]
Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation
Wang W, Jiao W, Hao Y , Wang X, Shi S, Tu Z, et al. Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation. In: Muresan S, Nakov P, Villavicencio A, editors. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Dublin, Ireland: Association for Computational L...
work page 2022
-
[56]
MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented gen- eration systems
Katsis Y , Rosenthal S, Fadnis K, Gunasekara C, Lee YS, Popa L, et al. MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented gen- eration systems. Transactions of the Association for Computational Linguistics. 2025;13:784-808
work page 2025
-
[57]
Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A, et al. The llama 3 herd of models. arXiv preprint arXiv:240721783. 2024
work page 2024
-
[58]
Multilingual e5 text embeddings: A technical report
Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:240205672. 2024
work page 2024
-
[59]
MTEB: Massive Text Embedding Benchmark
Muennighoff N, Tazi N, Magne L, Reimers N. MTEB: Massive Text Embedding Benchmark. In: Vlachos A, Augenstein I, editors. Proceedings of the 17th Con- ference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia: Association for Computational Linguistics; 2023. p. 2014-37. Available from:https://aclanthology.org/2023...
work page 2023
-
[60]
Nanni F, Chan R, Lazauskas T, Geddes J. n/a, editor. Why we still need small lan- guage models – even in the age of frontier AI. n/a; 2025.https://www.turing. ac.uk/blog/why-we-still-need-small-language-models
work page 2025
-
[61]
Random-effects models for longitudinal data
Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982:963-74
work page 1982
-
[62]
The artificial-social- agent questionnaire: establishing the long and short questionnaire versions
Fitrianie S, Bruijnes M, Li F, Abdulrahman A, Brinkman WP. The artificial-social- agent questionnaire: establishing the long and short questionnaire versions. In: Pro- ceedings of the 22nd ACM International Conference on Intelligent Virtual Agents
-
[63]
Use of ranks in one-criterion variance analysis
Kruskal WH, Wallis W A. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association. 1952;47(260):583-621
work page 1952
-
[64]
Multiple comparisons using rank sums
Dunn OJ. Multiple comparisons using rank sums. Technometrics. 1964;6(3):241- 52
work page 1964
-
[65]
Bonferroni and ˇSid´ak corrections for multiple comparisons
Abdi H, et al. Bonferroni and ˇSid´ak corrections for multiple comparisons. Ency- clopedia of measurement and statistics. 2007;3(01):2007
work page 2007
-
[66]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Suzgun M, Scales N, Sch¨arli N, Gehrmann S, Tay Y , Chung HW, et al. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. In: Rogers A, Boyd-Graber J, Okazaki N, editors. Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguis- tics; 2023. p. 13003-51. Available from:h...
work page 2023
-
[67]
Instruction-following evaluation for large language models
Zhou J, Lu T, Mishra S, Brahma S, Basu S, Luan Y , et al. Instruction-following evaluation for large language models. arXiv preprint arXiv:231107911. 2023. May 5, 2026
work page 2023
-
[68]
Ma S, Lei Y , Wang X, Zheng C, Shi C, Yin M, et al. Who should i trust: Ai or myself? leveraging human and ai correctness likelihood to promote appropriate trust in ai-assisted decision-making. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems; 2023. p. 1-19
work page 2023
-
[69]
HyEnA: A Hybrid Method for Extracting Arguments from Opinions
Van Der Meer M, Liscio E, Jonker CM, Plaat A, V ossen P, Murukannaiah PK. HyEnA: A Hybrid Method for Extracting Arguments from Opinions. In: Schlobach S, Perez-Ortiz M, Tielman M, editors. HHAI2022: Augmenting Human Intellect – Proceedings of the 1st International Conference on Hybrid Human-Artificial Intel- ligence. vol. 354 of Frontiers in Artificial In...
-
[70]
Query2doc: Query Expansion with Large Language Mod- els
Wang L, Yang N, Wei F. Query2doc: Query Expansion with Large Language Mod- els. In: Bouamor H, Pino J, Bali K, editors. Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Processing. Singapore: Associ- ation for Computational Linguistics; 2023. p. 9414-23. Available from:https: //aclanthology.org/2023.emnlp-main.585/
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.