pith. sign in

arxiv: 2604.25924 · v1 · submitted 2026-04-01 · 💻 cs.CL · cs.AI· cs.IR

Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects

Pith reviewed 2026-05-13 22:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords virtual assistantretrieval-augmented generationlarge language modelseducational supporthallucinationsbachelor projectsevaluation studyuniversity regulations
0
0 comments X p. Extension

The pith

Retrieval-augmented generation lets a virtual assistant give reliable answers to bachelor students' questions on project regulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds and tests a virtual assistant that uses retrieval-augmented generation to help Maastricht University bachelor students find accurate information on project rules. Standard large language models often produce hallucinations or omit key details when handling narrow, changing regulations, so the system adds retrieval of current documents to ground its outputs. A structured evaluation plus live testing with students shows the assistant meets practical needs in this setting. This matters for anyone who wants LLM tools to work dependably in education without constant manual correction. The work focuses on one real deployment to measure whether the RAG fix actually delivers usable results.

Core claim

We propose a virtual assistant based on a Retrieval-Augmented Generation system that enhances the accuracy and reliability of responses by integrating up-to-date, domain-specific knowledge. Through a robust evaluation framework and real-life testing, we demonstrate that our virtual assistant can effectively meet the needs of students while addressing the inherent challenges of applying Large Language Models to a specialized educational context.

What carries the argument

Retrieval-Augmented Generation system that pulls relevant project-regulation documents into each prompt so the language model produces context-specific answers.

If this is right

  • The assistant supplies timely and accurate information on project rules without requiring staff to answer every query.
  • RAG integration reduces the rate of incorrect or incomplete responses compared with a plain language model.
  • A repeatable evaluation framework can measure success for similar assistants in other specialized university contexts.
  • Real-life student testing provides evidence that the system handles the practical demands of an educational setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Periodic refresh of the retrieved document store would be required to keep answers current when rules change.
  • The same retrieval approach could support queries in other narrow administrative domains such as course registration or exam policies.
  • Combining the assistant with a feedback loop that logs and corrects errors could further improve reliability over time.
  • Deployment at additional universities would test whether the same RAG setup transfers without major redesign.

Load-bearing premise

Adding current domain documents through retrieval is enough to cut hallucinations and missing facts so the assistant gives correct answers on project regulations.

What would settle it

Run the same student queries on the live system and check whether any answers still contain wrong or missing regulation details that the source documents actually cover.

Figures

Figures reproduced from arXiv: 2604.25924 by Aki H\"arm\"a, Chiara Magrone, Dumitru Ver\c{s}ebeniuc, Martijn Bouss\'e, Martijn Elands, Mohammad Falah, Sara Falahatkar.

Figure 1.3
Figure 1.3. Figure 1.3: The self-reflection part is our fallback mechanism, which evaluates the generated response for hallucinations and relevance. If one part of self-reflection fails, it [PITH_FULL_IMAGE:figures/full_fig_p003_1_3.png] view at source ↗
Figure 1
Figure 1. Figure 1: The VA architecture consists of retrieval, generation, and self-reflection [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall student satisfaction with the VA system showing a generally [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human-like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context-specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressing these challenges by developing a virtual assistant designed to support students at Maastricht University in navigating project-specific regulations. We propose a virtual assistant based on a Retrieval-Augmented Generation system that enhances the accuracy and reliability of responses by integrating up-to-date, domain-specific knowledge. Through a robust evaluation framework and real-life testing, we demonstrate that our virtual assistant can effectively meet the needs of students while addressing the inherent challenges of applying Large Language Models to a specialized educational context. This work contributes to the ongoing discourse on improving LLM-based systems for specific applications and highlights areas for further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript describes the development of a generative AI-based virtual assistant using Retrieval-Augmented Generation (RAG) to assist bachelor students at Maastricht University with project-specific regulations. The authors claim that integrating domain-specific knowledge via RAG addresses challenges such as hallucinations and missing information in LLMs, and through a robust evaluation framework and real-life testing, demonstrate that the system effectively meets student needs.

Significance. If substantiated with quantitative evidence, the work could provide a practical case study on applying RAG to specialized educational domains. However, the lack of reported metrics, baselines, or detailed evaluation protocols significantly diminishes its potential impact and contribution to the field.

major comments (1)
  1. Abstract: The central claim that the virtual assistant 'can effectively meet the needs of students while addressing the inherent challenges' via 'a robust evaluation framework and real-life testing' is unsupported by any quantitative data. No accuracy, F1, hallucination rates, test-set size, baseline comparisons to non-RAG LLMs, or statistical tests are reported, leaving the assertion that RAG 'sufficiently reduces' hallucinations unverifiable.
minor comments (1)
  1. Abstract: The phrase 'robust evaluation framework' is used without even a high-level indication of the metrics or protocol, which weakens the summary of the contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We appreciate the constructive feedback on the evaluation aspects of our RAG-based virtual assistant for Maastricht University bachelor project regulations. We address the major comment point by point below and commit to revisions that strengthen the quantitative support for our claims.

read point-by-point responses
  1. Referee: Abstract: The central claim that the virtual assistant 'can effectively meet the needs of students while addressing the inherent challenges' via 'a robust evaluation framework and real-life testing' is unsupported by any quantitative data. No accuracy, F1, hallucination rates, test-set size, baseline comparisons to non-RAG LLMs, or statistical tests are reported, leaving the assertion that RAG 'sufficiently reduces' hallucinations unverifiable.

    Authors: We agree that the abstract's claims would be more compelling with explicit quantitative metrics. The full manuscript details a real-life testing process involving student queries on project regulations, with qualitative observations on response relevance and reduced hallucinations due to the RAG retrieval step. However, to address this concern directly, we will revise the abstract to report key figures from our evaluation (e.g., number of test queries, observed accuracy on factual correctness, and notes on hallucination instances before/after RAG). We will also expand the evaluation section with a table summarizing the test-set size, protocol, and any internal baseline comparisons to a non-RAG LLM setup. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation study

full rationale

The paper is an empirical evaluation of a RAG-based virtual assistant with no equations, derivations, parameter fittings, or self-referential definitions. Claims rest on a described evaluation framework and real-life testing rather than any mathematical chain that reduces to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear. This is a standard applied AI evaluation paper with no detectable circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard assumptions of RAG systems and LLM behavior in specialized domains; no free parameters, new axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5494 in / 1022 out tokens · 30233 ms · 2026-05-13T22:43:25.702275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Virtual assistants for learning: A systematic literature review

    Regina Gubareva and Rui Lopes. Virtual assistants for learning: A systematic literature review. InProceedings of the 12th International Conference on Computer Supported Education - Volume 1: CSEDU,, pages 97–103. Institute for Systems and Technologies of Information, Control and Communication, SciTePress, 2020

  2. [2]

    Retrieval-augmented generation for knowledge-intensive nlp tasks, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨ uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨ aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2020

  3. [3]

    A survey on rag meeting llms: Towards retrieval-augmented large language models

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Knowledge Discovery and Data Mining ’24, page 6491–6501, New York, NY, USA, 2024. Asso...

  4. [4]

    Grape: Knowledge graph enhanced passage reader for open-domain question answering

    Mingxuan Ju, Wenhao Yu, Tong Zhao, Chuxu Zhang, and Yanfang Ye. Grape: Knowledge graph enhanced passage reader for open-domain question answering. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, edi- tors,Findings of the Association for Computational Linguistics: The 2022 Conference on Empirical Methods in Natural Language Processing, pages 169–181, A...

  5. [5]

    Generative ai based virtual assistant for reconciliation research

    Daksha Yadav, Sabrina Zhang, Tom Jin, Prakash Krishnan, and Des Clarke. Generative ai based virtual assistant for reconciliation research. InThe Association for the Advancement of Artificial Intelligence 2024 Workshop on AI for Financial Services, 2024

  6. [6]

    Retrieval-augmented generation for natural language processing: A survey, 2024

    Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, and Chun Jason Xue. Retrieval-augmented generation for natural language processing: A survey, 2024

  7. [7]

    Retrieval-augmented generation for large language models: A survey, 2024

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024

  8. [8]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Lin-

  9. [9]

    Association for Computational Linguistics

    CONCLUSIONS 17 guistics: The 2023 Conference on Empirical Methods in Natural Language Processing, pages 9248–9274, Singapore, December 2023. Association for Computational Linguistics

  10. [10]

    A comprehensive survey on vector database: Storage and retrieval technique, challenge.Computing Research Repository, abs/2310.11703, 2023

    Yikun Han, Chunjiang Liu, and Pengfei Wang. A comprehensive survey on vector database: Storage and retrieval technique, challenge.Computing Research Repository, abs/2310.11703, 2023

  11. [11]

    https://cloud.google.com/vertex-ai/generative-ai/docs/ model-reference/text-embeddings-api

    Text embeddings API, Generative AI on Vertex AI, Google Cloud. https://cloud.google.com/vertex-ai/generative-ai/docs/ model-reference/text-embeddings-api. [Accessed 11-07-2024]

  12. [12]

    Open AI Text Embedding Model

    OpenAI. Open AI Text Embedding Model. https://platform.openai. com/docs/guides/embeddings. [Accessed 11-07-2024]

  13. [13]

    https: //docs.mistral.ai/capabilities/embeddings/

    Embeddings — Mistral AI Large Language Models — docs.mistral.ai. https: //docs.mistral.ai/capabilities/embeddings/. [Accessed 11-07-2024]

  14. [14]

    Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2024

  15. [15]

    A survey of text representation and embedding techniques in nlp.Institute of Electrical and Electronics Engineers Access, 11:36120–36146, 2023

    Rajvardhan Patil, Sorio Boit, Venkat Gudivada, and Jagadeesh Nandigam. A survey of text representation and embedding techniques in nlp.Institute of Electrical and Electronics Engineers Access, 11:36120–36146, 2023

  16. [16]

    Retrieve anything to augment large language models, 2023

    Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. Retrieve anything to augment large language models, 2023

  17. [17]

    Prompt engineering

    OpenAI. Prompt engineering. https://platform.openai.com/docs/ guides/prompt-engineering/strategy-provide-reference-text . [Ac- cessed 12-07-2024]

  18. [18]

    Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua...

  19. [19]

    RA- GAs: Automated evaluation of retrieval augmented generation

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RA- GAs: Automated evaluation of retrieval augmented generation. In Nikolaos Aletras and Orphee De Clercq, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta, March 2...

  20. [20]

    Metrics — component-wise evaluation

    Ragas Documentation. Metrics — component-wise evaluation. https: //docs.ragas.io/en/stable/concepts/metrics/index.html, 2024. [Ac- cessed July 2024]. 18 D. Ver¸ sebeniuc et al

  21. [21]

    Benchmarking large language models in retrieval-augmented generation.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17754–17762, 03 2024

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation.Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17754–17762, 03 2024

  22. [22]

    arXiv preprint arXiv:2404.02060 , year=

    Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long- context llms struggle with long in-context learning.Computing Research Repository, abs/2404.02060, 2024

  23. [23]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, Jun 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, Jun 2024

  24. [24]

    Precise zero- shot dense retrieval without relevance labels

    Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero- shot dense retrieval without relevance labels. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1762–1777, Toronto, Canada, July 2023. Association for Comput...