arxiv: 2604.06179 · v1 · submitted 2026-02-04 · 💻 cs.IR · cs.CL

Recognition: no theorem link

ARIA: Adaptive Retrieval Intelligence Assistant -- A Multimodal RAG Framework for Domain-Specific Engineering Education

Yue Luo , Dibakar Roy Sarkar , Rachel Herring Sangree , Somdatta Goswami

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:48 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords multimodal RAGdomain-specific educationengineering teaching assistantquery filteringretrieval-augmented generationeducational AIcourse materials processing

0 comments

The pith

A multimodal RAG framework answers every relevant engineering course question while rejecting nearly all irrelevant ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a retrieval-augmented system that builds teaching assistants from specific course materials rather than relying on a model's general training. It processes documents that include text, formulas, and diagrams to create a searchable knowledge base, then uses that base to generate responses. Engineered prompts keep answers consistent with the course content and block attempts to answer outside it. When tested on lecture materials from a sophomore mechanics course, the system handled all twenty on-topic questions correctly, rejected fifty-eight of sixty off-topic queries, and produced responses rated 4.89 out of 5 for quality. This points to a way to deploy reliable, course-specific AI support without retraining the underlying model for each new subject.

Core claim

The paper claims that a retrieval-augmented generation architecture, equipped with multimodal extraction for text, formulas, and diagrams plus controlled prompting, produces domain-specific educational assistants that achieve 100 percent recall on relevant questions, 90.9 percent precision in query filtering, and superior pedagogical quality compared with a general large language model on the same materials.

What carries the argument

The multimodal retrieval pipeline that converts course documents containing text, formulas, and diagrams into semantic embeddings for grounded response generation and query filtering.

If this is right

Course materials can be loaded directly to create a custom assistant without retraining the base model.
The built-in filtering step stops the assistant from generating answers for questions outside the course scope.
Response quality for on-topic questions can reach near-expert levels under expert review.
The same architecture can be reused across multiple courses with only the source documents changed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Instructors could update the assistant each semester simply by adding new lecture files rather than rebuilding the system.
This approach may reduce the frequency of incorrect technical answers that general models give on specialized topics.
Linking the assistant to a course platform could give students immediate, material-grounded help during study.
Testing on courses outside engineering would show whether the same extraction and filtering steps work in other fields.

Load-bearing premise

Results from one engineering course tested with a small fixed set of queries will hold for other subjects, varied student phrasing, and real classroom use.

What would settle it

Measure whether the same system maintains 100 percent recall and high precision when loaded with materials from a different course and tested on a fresh set of relevant and irrelevant questions.

Figures

Figures reproduced from arXiv: 2604.06179 by Dibakar Roy Sarkar, Rachel Herring Sangree, Somdatta Goswami, Yue Luo.

**Figure 2.** Figure 2: The process begins with raw instructional materials such as PDFs, slides, and assignments. These documents [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 1.** Figure 1: Architecture of the ARIA web application demonstrating the offline data preparation pipeline (left) and [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Multimodal content extraction pipeline for ARIA. Course materials are converted to pdf or images and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Schematic representation of the ARIA pedagogical control and response generation framework. The process [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Demonstration of the web application, available at [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation results of the ARIA system performance on document relevance classification task. The experiment [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Comparative evaluation of ARIA versus ChatGPT-5 performance on educational quality metrics for Statics [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Developing effective, domain-specific educational support systems is central to advancing AI in education. Although large language models (LLMs) demonstrate remarkable capabilities, they face significant limitations in specialized educational applications, including hallucinations, limited knowledge updates, and lack of domain expertise. Fine-tuning requires complete model retraining, creating substantial computational overhead, while general-purpose LLMs often provide inaccurate responses in specialized contexts due to reliance on generalized training data. To address this, we propose ARIA (Adaptive Retrieval Intelligence Assistant), a Retrieval-Augmented Generation (RAG) framework for creating intelligent teaching assistants across university-level courses. ARIA leverages a multimodal content extraction pipeline combining Docling for structured document analysis, Nougat for mathematical formula recognition, and GPT-4 Vision API for diagram interpretation, with the e5-large-v2 embedding model for high semantic performance and low latency. This enables accurate processing of complex educational materials while maintaining pedagogical consistency through engineered prompts and response controls. We evaluate ARIA using lecture material from Statics and Mechanics of Materials, a sophomore-level civil engineering course at Johns Hopkins University, benchmarking against ChatGPT-5. Results demonstrate 97.5% accuracy in domain-specific question filtering and superior pedagogical performance. ARIA correctly answered all 20 relevant course questions while rejecting 58 of 60 non-relevant queries, achieving 90.9% precision, 100% recall, and 4.89/5.0 average response quality. These findings demonstrate that ARIA's course-agnostic architecture represents a scalable framework for domain-specific educational AI deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARIA is a practical RAG pipeline for engineering course materials that reports strong numbers on a tiny single-course test set.

read the letter

The core of this paper is a RAG system that chains Docling for document parsing, Nougat for math formulas, GPT-4 Vision for diagrams, and e5-large-v2 embeddings to answer questions from university engineering lectures. On Statics and Mechanics of Materials material from Johns Hopkins they claim it handled all 20 relevant questions, rejected 58 of 60 irrelevant ones, and scored 4.89/5 on quality while outperforming a ChatGPT-5 baseline. That pipeline is the actual contribution: it shows one workable way to ingest the mix of text, equations, and figures that appear in real course PDFs without full model retraining. The numbers on their test look clean and the focus on pedagogical consistency through prompts is reasonable for an applied system. The main limitation is the evaluation. The test rests on 80 queries drawn from one course, with no description of how they were selected, no diversity across phrasing or difficulty, no hold-out from another domain, and no statistical tests or error breakdown. The baseline comparison also lacks detail on prompting and configuration. This makes the 100% recall and 90.9% precision hard to interpret beyond the specific slice they measured. The work is aimed at people who need to deploy course-specific assistants in technical fields rather than readers looking for new retrieval theory. It is coherent on its own terms and shows honest attention to the practical constraints of educational materials. I would send it to peer review because the implementation choices are concrete and could be useful to others, but the authors would need a broader query set and clearer controls before the performance claims could be taken as general.

Referee Report

1 major / 2 minor

Summary. The paper introduces ARIA, a multimodal RAG framework for domain-specific engineering education that integrates Docling for structured document parsing, Nougat for formula recognition, GPT-4 Vision for diagram interpretation, and the e5-large-v2 embedding model, together with engineered prompts and response controls. Evaluated on lecture materials from a single Statics and Mechanics of Materials course at Johns Hopkins University, ARIA is benchmarked against ChatGPT-5 and claims 97.5% filtering accuracy, 100% recall on 20 relevant queries, 90.9% precision from rejecting 58 of 60 non-relevant queries, and 4.89/5.0 average response quality, positioning the system as a scalable, course-agnostic alternative to fine-tuning LLMs.

Significance. If the performance claims hold under broader testing, ARIA demonstrates a practical, low-overhead path to domain-specific educational assistants that mitigates hallucinations and knowledge staleness without full model retraining. The explicit multimodal pipeline for equations and diagrams is a concrete strength for STEM contexts, and the emphasis on engineered prompts for pedagogical consistency offers a replicable template for other courses.

major comments (1)

Evaluation section: the reported 100% recall and 90.9% precision rest on a hand-curated set of only 80 queries (20 relevant + 60 non-relevant) drawn exclusively from one course's materials at a single institution. No description is given of query authorship, sampling from real student logs, inter-rater reliability, or hold-out testing on other domains or material types, so the generalization of the superiority claim over ChatGPT-5 cannot be assessed from the presented evidence.

minor comments (2)

Methods: the exact system prompts, few-shot examples, and response-control instructions used for both ARIA and the ChatGPT-5 baseline are not provided, preventing reproduction of the comparison.
Results: no error analysis, confusion-matrix breakdown, or statistical significance tests accompany the accuracy and quality metrics.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the evaluation section below and will revise the paper to improve transparency and contextualize our claims appropriately.

read point-by-point responses

Referee: Evaluation section: the reported 100% recall and 90.9% precision rest on a hand-curated set of only 80 queries (20 relevant + 60 non-relevant) drawn exclusively from one course's materials at a single institution. No description is given of query authorship, sampling from real student logs, inter-rater reliability, or hold-out testing on other domains or material types, so the generalization of the superiority claim over ChatGPT-5 cannot be assessed from the presented evidence.

Authors: We acknowledge the controlled nature of the evaluation and will revise the Evaluation section to provide a detailed description of the query curation process. The 80 queries were authored by the research team (with domain expertise in civil engineering) to cover core topics from the Statics and Mechanics of Materials lecture materials for the relevant set, while the non-relevant queries were drawn from unrelated engineering and general topics to test filtering. We did not sample from real student interaction logs to preserve privacy and maintain experimental control. Inter-rater reliability metrics were not computed as query design was performed internally by the authors. We agree that the single-course scope limits strong generalization claims; we will add explicit caveats and a limitations paragraph stating that these results represent a proof-of-concept demonstration on one domain-specific course. The architecture itself is designed to be course-agnostic via the same multimodal pipeline and prompts, but we will temper language around superiority over ChatGPT-5 to reflect the current evidence base. revision: partial

standing simulated objections not resolved

The study does not include hold-out testing or evaluation on other courses, institutions, or material types, so we cannot provide such results or inter-rater reliability data from external annotators.

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external course materials

full rationale

The paper describes a multimodal RAG architecture and reports direct empirical measurements (97.5% filtering accuracy, 100% recall on 20 queries, 4.89/5 quality) against an external benchmark (ChatGPT-5) and fixed course lecture materials. No equations, fitted parameters, or self-citations are used to derive the performance numbers; the results are presented as measured outcomes on a held-out query set rather than being forced by construction from the framework definition itself. The evaluation is therefore self-contained against external references and does not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on the assumption that the chosen extraction tools perform reliably on engineering PDFs and that prompt engineering can enforce pedagogical consistency; no new physical entities or fitted constants are introduced beyond standard model selection.

free parameters (1)

Selection of e5-large-v2 embedding model
Chosen for claimed semantic performance and low latency; no explicit fitting values reported but represents a hyperparameter choice.

axioms (1)

domain assumption Multimodal tools (Docling, Nougat, GPT-4 Vision) accurately extract structured content including formulas and diagrams from course materials.
Invoked to justify the content processing pipeline before retrieval.

pith-pipeline@v0.9.0 · 5593 in / 1395 out tokens · 58442 ms · 2026-05-16T07:48:37.731180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 9 internal anchors

[1]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901

work page 2020
[2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, P. Fung, Survey of hallucination in natural language generation, ACM computing surveys 55 (12) (2023) 1–38

work page 2023
[5]

Huang, W

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al., A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, ACM Transactions on Information Systems 43 (2) (2025) 1–55. 12 APREPRINT

work page 2025
[6]

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, H. Wang, Retrieval-augmented generation for large language models: A survey, arXiv preprint arXiv:2312.10997 2 (1) (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Goldberg, A primer on neural network models for natural language processing, Journal of Artificial Intelligence Research 57 (2016) 345–420

Y . Goldberg, A primer on neural network models for natural language processing, Journal of Artificial Intelligence Research 57 (2016) 345–420

work page 2016
[8]

Language Models as Knowledge Bases?

F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y . Wu, A. H. Miller, S. Riedel, Language models as knowledge bases?, arXiv preprint arXiv:1909.01066 (2019)

work page internal anchor Pith review arXiv 1909
[9]

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

A. Roberts, C. Raffel, N. Shazeer, How much knowledge can you pack into the parameters of a language model?, arXiv preprint arXiv:2002.08910 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2002
[10]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rock- täschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474

work page 2020
[11]

Y . Qin, S. Hu, Y . Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, X. Zhou, Y . Huang, C. Xiao, et al., Tool learning with foundation models, ACM Computing Surveys 57 (4) (2024) 1–40

work page 2024
[12]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019
[13]

J. D. M.-W. C. Kenton, L. K. Toutanova, et al., Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of naacL-HLT, V ol. 1, Minneapolis, Minnesota, 2019

work page 2019
[14]

M. F. Shojaei, R. Gulati, B. A. Jasperson, S. Wang, S. Cimolato, D. Cao, W. Neiswanger, K. Garikipati, Ai-university: An llm-based platform for instructional alignment to scientific classrooms, arXiv preprint arXiv:2504.08846 (2025)

work page arXiv 2025
[15]

Ovadia, M

O. Ovadia, M. Brief, M. Mishaeli, O. Elisha, Fine-tuning or retrieval? comparing knowledge injection in llms, arXiv preprint arXiv:2312.05934 (2023)

work page arXiv 2023
[16]

Retrieval augmentation reduces hallucination in conversation, 2021

K. Shuster, S. Poff, M. Chen, D. Kiela, J. Weston, Retrieval augmentation reduces hallucination in conversation, arXiv preprint arXiv:2104.07567 (2021)

work page arXiv 2021
[17]

K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang, Retrieval augmented language model pre-training, in: International conference on machine learning, PMLR, 2020, pp. 3929–3938

work page 2020
[18]

Borgeaud, A

S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al., Improving language models by retrieving from trillions of tokens, in: International conference on machine learning, PMLR, 2022, pp. 2206–2240

work page 2022
[19]

H. Chen, R. Pasunuru, J. Weston, A. Celikyilmaz, Walking down the memory maze: Beyond context limit through interactive reading, arXiv preprint arXiv:2310.05029 (2023)

work page arXiv 2023
[20]

Swacha, M

J. Swacha, M. Gracel, Retrieval-augmented generation (rag) chatbots for education: A survey of applications, Applied Sciences 15 (8) (2025) 4234

work page 2025
[21]

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, et al., A survey on large language model based autonomous agents, Frontiers of Computer Science 18 (6) (2024) 186345

work page 2024
[22]

Nougat: Neural Optical Understanding for Academic Documents

L. Blecher, G. Cucurull, T. Scialom, R. Stojnic, Nougat: Neural optical understanding for academic documents, arXiv preprint arXiv:2308.13418 (2023)

work page internal anchor Pith review arXiv 2023
[23]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908
[24]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017)

work page 2017
[25]

Schulman, B

J. Schulman, B. Zoph, C. Kim, J. Hilton, J. Menick, J. Weng, J. F. C. Uribe, L. Fedus, L. Metz, M. Pokorny, et al., Chatgpt: Optimizing language models for dialogue, OpenAI blog 2 (4) (2022)

work page 2022
[26]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al., Deepseek llm: Scaling open-source language models with longtermism, arXiv preprint arXiv:2401.02954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al., Lora: Low-rank adaptation of large language models., ICLR 1 (2) (2022) 3

work page 2022
[28]

C. Li, H. Farkhoor, R. Liu, J. Yosinski, Measuring the intrinsic dimension of objective landscapes, arXiv preprint arXiv:1804.08838 (2018). 13 APREPRINT

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Aghajanyan, L

A. Aghajanyan, L. Zettlemoyer, S. Gupta, Intrinsic dimensionality explains the effectiveness of language model fine-tuning, arXiv preprint arXiv:2012.13255 (2020)

work page arXiv 2012
[30]

Q. Xu, J. Gu, J. Lu, Leveraging artificial intelligence and large language models for enhanced teaching and learning: A systematic literature review, in: 2024 13th International Conference on Computer Technologies and Development (TechDev), IEEE, 2024, pp. 73–77

work page 2024
[31]

S. Wang, T. Xu, H. Li, C. Zhang, J. Liang, J. Tang, P. S. Yu, Q. Wen, Large language models for education: A survey and outlook, arXiv preprint arXiv:2403.18105 (2024)

work page arXiv 2024
[32]

S. Yang, H. Zhao, Y . Xu, K. Brennan, B. Schneider, Debugging with an ai tutor: Investigating novice help-seeking behaviors and perceived learning, in: Proceedings of the 2024 ACM Conference on International Computing Education Research-V olume 1, 2024, pp. 84–94

work page 2024
[33]

Simulating classroom education with llm-empowered agents.arXiv preprint arXiv:2406.19226,

Z. Zhang, D. Zhang-Li, J. Yu, L. Gong, J. Zhou, Z. Hao, J. Jiang, J. Cao, H. Liu, Z. Liu, et al., Simulating classroom education with llm-empowered agents, arXiv preprint arXiv:2406.19226 (2024)

work page arXiv 2024
[34]

Hicke, A

Y . Hicke, A. Agarwal, Q. Ma, P. Denny, Ai-ta: Towards an intelligent question-answer teaching assistant using open-source llms, arXiv preprint arXiv:2311.02775 (2023)

work page arXiv 2023
[35]

Mehta, N

A. Mehta, N. Gupta, A. Balachandran, D. Kumar, P. Jalote, et al., Can chatgpt play the role of a teaching assistant in an introductory programming course?, arXiv preprint arXiv:2312.07343 (2023)

work page arXiv 2023
[36]

C. K. Y . Chan, A comprehensive ai policy education framework for university teaching and learning, International journal of educational technology in higher education 20 (1) (2023) 38

work page 2023
[37]

R. Yu, Z. Xu, S. CH-Wang, R. Arum, Whose chatgpt? unveiling real-world educational inequalities introduced by large language models, arXiv preprint arXiv:2410.22282 (2024)

work page arXiv 2024
[38]

W. Xing, C. Li, H. Li, W. Zhu, B. Lyu, Z. Yan, Is retrieval-augmented generation all you need? investigating structured external memory to enhance large language models’ generation for math learning (2024)

work page 2024
[39]

F. P. Beer, E. R. Johnston Jr, J. T. DeWolf, D. F. Mazurek, Mechanics of materials (2006)

work page 2006
[40]

Livathinos, C

N. Livathinos, C. Auer, M. Lysak, A. Nassar, M. Dolfi, P. Vagenas, C. B. Ramis, M. Omenetti, K. Dinkla, Y . Kim, et al., Docling: An efficient open-source toolkit for ai-driven document conversion, arXiv preprint arXiv:2501.17887 (2025). Appendix A: ARIA Domain Experiment Results Table 3: Complete ARIA Domain Experiment Question Set and Results ID Questio...

work page arXiv 2025