RAGe: A Retrieval-Augmented Generation Evaluation Framework

Arthur Accorsi; Dalvan Griebler; Felipe Meneguzzi; Gustavo Losch do Amaral; Jo\~ao Pedro de Moura; Larissa Guder; Marcio Sorraglia Pinho; Maur\'icio Cec\'ilio Magnaguagno

arxiv: 2605.27445 · v1 · pith:VRGE7HD2new · submitted 2026-05-23 · 💻 cs.IR · cs.AI

RAGe: A Retrieval-Augmented Generation Evaluation Framework

Larissa Guder , Jo\~ao Pedro de Moura , Arthur Accorsi , Gustavo Losch do Amaral , Maur\'icio Cec\'ilio Magnaguagno , Felipe Meneguzzi , Marcio Sorraglia Pinho , Dalvan Griebler This is my paper

Pith reviewed 2026-06-30 12:13 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords Retrieval-Augmented GenerationRAG evaluationresource telemetrycomponent recommendationLLM applicationsbenchmarking frameworkhardware constraintsdomain-specific setups

0 comments

The pith

RAGe benchmarks RAG pipelines by correlating output quality with hardware resource use to recommend components for a given dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RAGe as a modular framework that benchmarks Retrieval-Augmented Generation applications. It evaluates choices for document chunking, embedding models, retrievers, and vector databases by measuring both quality metrics and the hardware resources each choice consumes. The framework then uses those correlations to suggest the best combination for a particular domain and hardware setting. A sympathetic reader would care because current RAG deployments often require manual trial-and-error on expensive machines, and a telemetry-driven recommender could reduce that cost.

Core claim

RAGe evaluates trade-offs among accuracy, efficiency, and scalability in RAG pipelines by directly correlating retrieval and generation quality with underlying hardware constraints from core techniques including document chunking, vector databases, embedding models, and retrievers, thereby suggesting the best components for a domain-specific dataset.

What carries the argument

RAGe framework, which collects resource telemetry during RAG component execution and correlates it with quality scores to produce component recommendations.

If this is right

Researchers obtain explicit guidance on which RAG components suit their operational constraints without exhaustive manual testing.
Rapid prototyping of domain-specific RAG applications becomes feasible on consumer-grade hardware.
Trade-offs between accuracy and efficiency are quantified through direct telemetry correlations rather than isolated benchmarks.
Scalability assessments incorporate hardware limits from the start of component selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same telemetry-correlation method could be tested on non-RAG LLM tasks that also involve retrieval or generation steps.
If the correlations prove stable across many datasets, the framework might serve as the basis for automated optimization loops that iterate on recommendations.
Porting the system to additional hardware classes would reveal whether the learned mappings are hardware-specific or more general.

Load-bearing premise

Resource telemetry collected from standard RAG components can be directly correlated with retrieval and generation quality in a way that reliably identifies the best components for arbitrary new datasets and hardware.

What would settle it

Apply RAGe to a held-out domain dataset, deploy the recommended components on the target hardware, and measure whether they produce lower accuracy or higher resource consumption than a standard baseline selection.

read the original abstract

Deploying Large Language Model (LLM) applications, particularly those relying on Retrieval-Augmented Generation (RAG), remains challenging due to high computational demands, outdated knowledge bases, and the need to manually select optimal pipeline components. In this work, we propose a modular framework for benchmarking and guiding the efficient development of RAG applications by focusing on resource telemetry and component recommendation, suggesting the best components for a domain-specific dataset. Our approach leverages core techniques in LLM applications, including document chunking, vector databases, embedding models, and retrievers, to evaluate trade-offs among accuracy, efficiency, and scalability. By directly correlating retrieval and generation quality with underlying hardware constraints, RAGe supports researchers to identify the most effective, domain-specific RAG setups for their specific operational needs, facilitating rapid prototyping even on consumer-grade hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAGe describes a modular RAG benchmarking framework that adds hardware telemetry for component recommendations, but supplies no experiments, metrics, or results to show the correlation works.

read the letter

The paper's main point is a framework called RAGe meant to benchmark RAG pipelines while tracking resource use and then recommending components for a given dataset and hardware. It covers the usual stages—chunking, embeddings, vector stores, retrievers—and adds telemetry to link those to quality and efficiency.

The practical angle is the only real addition. Most RAG work assumes plenty of compute, so explicitly factoring in hardware limits for consumer-grade setups is a reasonable direction for applied work. The modular structure also makes it easy to swap pieces and measure trade-offs.

Beyond that description, the paper contains no implementation details, no defined metrics for the telemetry-quality link, and no test cases. The claim that the framework can identify effective domain-specific setups rests entirely on an unshown correlation. The stress-test note is accurate here: without data or an algorithm showing the mapping is reliable rather than spurious, the recommendation engine stays unverified.

This is aimed at engineers and applied researchers who prototype RAG systems under tight resource constraints and want a structured comparison tool. A reader looking for new methods, validated benchmarks, or reproducible findings will not get much from it.

I would not send this for peer review in its current form. It needs at least one concrete experiment with results before referees could judge whether the telemetry approach delivers on its promises.

Referee Report

2 major / 0 minor

Summary. The paper proposes RAGe, a modular framework for benchmarking RAG pipelines. It focuses on resource telemetry from components including document chunking, vector databases, embedding models, and retrievers, with the goal of correlating these metrics with retrieval/generation quality to recommend optimal, domain-specific setups that balance accuracy, efficiency, and scalability, including on consumer hardware.

Significance. If the claimed correlation and recommendation mechanism were validated, the framework would address a practical bottleneck in RAG deployment by enabling data-driven component selection without exhaustive manual tuning.

major comments (2)

Abstract: the central claim that the framework 'directly correlat[es] retrieval and generation quality with underlying hardware constraints' to 'suggest the best components' is asserted without any accompanying methodology, metrics, algorithm, or experimental results demonstrating that such a correlation exists, is causal, or generalizes across domains.
Abstract and full text: no implementation details, pseudocode, evaluation protocol, datasets, or telemetry-quality mapping are supplied, leaving the recommendation engine as an unverified assertion rather than a demonstrated capability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable feedback. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [—] Abstract: the central claim that the framework 'directly correlat[es] retrieval and generation quality with underlying hardware constraints' to 'suggest the best components' is asserted without any accompanying methodology, metrics, algorithm, or experimental results demonstrating that such a correlation exists, is causal, or generalizes across domains.

Authors: We agree with the referee that the abstract makes a central claim without accompanying details in the provided text. We will revise the abstract to tone down the claim or reference the new experimental validation we will add, and include in the full text the methodology, metrics, algorithm, and results showing the correlation and its generalizability across domains. revision: yes
Referee: [—] Abstract and full text: no implementation details, pseudocode, evaluation protocol, datasets, or telemetry-quality mapping are supplied, leaving the recommendation engine as an unverified assertion rather than a demonstrated capability.

Authors: We concur that implementation details are missing from the current version. The revised manuscript will supply implementation details, pseudocode, the evaluation protocol, datasets, and the telemetry-quality mapping to substantiate the recommendation engine. revision: yes

Circularity Check

0 steps flagged

No circularity: framework proposal contains no derivations, fits, or self-referential predictions

full rationale

The paper proposes a modular benchmarking framework for RAG pipelines that collects resource telemetry and suggests components for domain-specific datasets. No equations, fitted parameters, predictions derived from subsets of data, or load-bearing self-citations appear in the provided text. The central claim of correlation between telemetry and quality is asserted as a capability of the framework rather than derived from prior results or inputs by construction. Because the work is self-contained as a descriptive proposal without any reduction of outputs to inputs via definition or citation chains, the derivation chain (such as it is) exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, which contains no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5704 in / 1092 out tokens · 41892 ms · 2026-06-30T12:13:58.902566+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 24 canonical work pages · 6 internal anchors

[1]

Wimusim: simulat- ing realistic variabilities in wearable imus for human activity recognition.Frontiers in Computer Science, V olume 7 - 2025, 2025

Saleh, Y., Abu Talib, M., Nasir, Q., Dakalbab, F.: Evaluating large language mod- els: a systematic review of efficiency, applications, and future directions. Frontiers in Computer ScienceV olume 7 - 2025(2025) https://doi.org/10.3389/fcomp. 2025.1523699

work page doi:10.3389/fcomp 2025
[2]

arXiv preprint arXiv:2505.19240 (2025)

Kostikova, A., Wang, Z., Bajri, D., P¨ utz, O., Paaßen, B., Eger, S.: LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models. arXiv preprint arXiv:2505.19240 (2025)

work page arXiv 2025
[3]

https://arxiv.org/ abs/2407.21059

Gao, Y., Xiong, Y., Wang, M., Wang, H.: Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks (2024). https://arxiv.org/ abs/2407.21059

work page arXiv 2024
[4]

In: Companion Proceedings of the ACM on Web Con- ference 2025

Jin, J., Zhu, Y., Dou, Z., Dong, G., Yang, X., Zhang, C., Zhao, T., Yang, Z., Wen, J.-R.: FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research. In: Companion Proceedings of the ACM on Web Con- ference 2025. WWW ’25, pp. 737–740. Association for Computing Machin- ery, New York, NY, USA (2025). https://doi.org/10.1145/3701716.37...

work page doi:10.1145/3701716.3715313 2025
[5]

In: Proceedings of the 31st International Conference on Neural Information Processing Systems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA (2017)

2017
[6]

In: Burstein, J., Doran, C., Solorio, T

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...

2019
[7]

Language Models are Few-Shot Learners

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[8]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., 30 Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and Efficient Foundation Language Models (2023). https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

arXiv preprint arXiv:2501.04040 (2025)

Matarazzo, A., Torlone, R.: A survey on large language models with some insights on their capabilities and limitations. arXiv preprint arXiv:2501.04040 (2025)

work page arXiv 2025
[10]

arXiv preprint arXiv:2509.23936 (2025)

Yuan, Z., Ding, Z., Vlachos, A.: Assessing large language models in updating their forecasts with new information. arXiv preprint arXiv:2509.23936 (2025)

work page arXiv 2025
[11]

arXiv preprint arXiv:2505.07968 (2025)

Wu, W., Xu, X., Gao, C., Diao, X., Li, S., Salas, L.A., Gui, J.: Assessing and mitigating medical knowledge drift and conflicts in large language models. arXiv preprint arXiv:2505.07968 (2025)

work page arXiv 2025
[12]

arXiv preprint arXiv:2403.08763 (2024)

Ibrahim, A., Th´ erien, B., Gupta, K., Richter, M.L., Anthony, Q., Lesort, T., Belilovsky, E., Rish, I.: Simple and Scalable Strategies to Continually Pre-train Large Language Models. arXiv preprint arXiv:2403.08763 (2024)

work page arXiv 2024
[13]

In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., Li, Q.: A survey on rag meeting llms: Towards retrieval-augmented large language models. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’24, pp. 6491–6501. Association for Computing Machin- ery, New York, NY, USA (2024). https://doi.org/10.1...

work page doi:10.1145/3637528.3671470 2024
[14]

In: Proceedings of the 34th Interna- tional Conference on Neural Information Processing Systems

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Proceedings of the 34th Interna- tional Conference on Neural Information Processing Systems. NIPS ’20. Curran Associates Inc., Red Hoo...

2020
[15]

Lost in the Middle: How Language Models Use Long Contexts

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A.M., Uszkoreit, J., Le, Q., Petrov, S.: Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics7...

work page internal anchor Pith review doi:10.1162/tacl 2019
[16]

In: Blunsom, P., Bordes, A., Cho, K., Cohen, S., Dyer, C., Grefenstette, E., Hermann, K.M., Rimell, L., Weston, J., Yih, S

Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., Sule- man, K.: NewsQA: A machine comprehension dataset. In: Blunsom, P., Bordes, A., Cho, K., Cohen, S., Dyer, C., Grefenstette, E., Hermann, K.M., Rimell, L., Weston, J., Yih, S. (eds.) Proceedings of the 2nd Workshop on Rep- resentation Learning for NLP, pp. 191–200. Association fo...

work page doi:10.18653/v1/w17-2623 2017
[17]

Gemma Team

Joshi, M., Choi, E., Weld, D., Zettlemoyer, L.: TriviaQA: A large scale dis- tantly supervised challenge dataset for reading comprehension. In: Barzilay, R., Kan, M.-Y. (eds.) Proceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611. Association for Computational Linguistics, Vancouve...

work page doi:10.18653/v1/p17-1147 2017
[18]

arXiv preprint arXiv:2402.01763 (2025)

Jing, Z., Su, Y., Han, Y.: When Large Language Models Meet Vector Databases: A Survey. arXiv preprint arXiv:2402.01763 (2025)

work page arXiv 2025
[19]

arXiv preprint arXiv:2312.09890 (2023)

Nastase, V., Merlo, P.: Grammatical information in bert sentence embeddings as two-dimensional arrays. arXiv preprint arXiv:2312.09890 (2023)

work page arXiv 2023
[20]

(eds.) Findings of the Association for Computational Linguistics: NAACL 2025, pp

Qu, R., Tu, R., Bao, F.S.: Is semantic chunking worth the computational cost? In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Findings of the Association for Computational Linguistics: NAACL 2025, pp. 2155–2177. Association for Compu- tational Linguistics, Albuquerque, New Mexico (2025). https://doi.org/10.18653/ v1/2025.findings-naacl.114 .https://aclantho...

2025
[21]

ACM Transactions on Information Systems42(1), 1–35 (2023)

Bruch, S., Gai, S., Ingber, A.: An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems42(1), 1–35 (2023)

2023
[22]

Journal of Documentation28(1), 11–21 (1972) https://doi.org/10

Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation28(1), 11–21 (1972) https://doi.org/10. 1108/eb026526

1972
[23]

arXiv preprint arXiv:2104.07186 (2021)

Gao, L., Dai, Z., Callan, J.: COIL: Revisit exact lexical match in informa- tion retrieval with contextualized inverted list. arXiv preprint arXiv:2104.07186 (2021)

work page arXiv 2021
[24]

International Journal of Applied Engineering Research9(2018)

D., A.: Semantic Similarity- A Review of Approaches and Metrics. International Journal of Applied Engineering Research9(2018)

2018
[25]

Passage Re-ranking with BERT

Nogueira, R., Cho, K.: Passage Re-ranking with BERT (2020). https://arxiv.org/ abs/1901.04085

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

doi: 10.1145/3703155

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language mod- els: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 43(2) (2025) https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[27]

In: Aletras, N., De Clercq, O

Es, S., James, J., Espinosa Anke, L., Schockaert, S.: RAGAs: Automated Evaluation of Retrieval Augmented Generation. In: Aletras, N., De Clercq, O. (eds.) Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 150–158. Association for Computational Linguistics, St. Julians, M...

2024
[28]

The VLDB Journal33(5), 1591–1615 (2024)

Pan, J.J., Wang, J., Li, G.: Survey of vector database management systems. The VLDB Journal33(5), 1591–1615 (2024)

2024
[29]

The Faiss library

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazar´ e, P.-E., Lomeli, M., Hosseini, L., J´ egou, H.: The Faiss library. arXiv preprint arXiv:2401.08281 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

In: Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pp

Mao, Y., He, J., Chen, C.: From prompts to templates: A systematic prompt template analysis for real-world LLMapps. In: Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pp. 75–86 (2025)

2025
[31]

In: Hernandez Farias, D.I., Hope, T., Li, M

Zhang, X., Song, Y.-Z., Wang, Y., Tang, S., Li, X., Zeng, Z., Wu, Z., Ye, W., Xu, W., Zhang, Y., Dai, X., Zhang, S., Wen, Q.: RAGLAB: A mod- ular and research-oriented unified framework for retrieval-augmented genera- tion. In: Hernandez Farias, D.I., Hope, T., Li, M. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Proce...

work page doi:10.18653/v1/2024.emnlp-demo.43 2024
[32]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,et al.: LoRA: Low-Rank Adaptation of Large Language Models. ICLR1(2), 3 (2022)

2022
[33]

QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient Finetuning of Quantized LLMs. URL https://arxiv. org/abs/2305.143142(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

In: European Conference on Information Retrieval, pp

Pradeep, R., Thakur, N., Sharifymoghaddam, S., Zhang, E., Nguyen, R., Campos, D., Craswell, N., Lin, J.: Ragnar¨ ok: A reusable RAG framework and baselines for TREC 2024 retrieval-augmented generation track. In: European Conference on Information Retrieval, pp. 132–148 (2025). Springer

2024
[35]

https://arxiv.org/abs/2412

Mao, Q., Luo, Y., Zhang, Q., Luo, Y., Cao, Z., Zhang, J., Hao, H., Chen, Z., Jiang, W., Liu, J., Wang, X., Huang, Z., Tan, Z., Jie, S., Li, B., Liu, X., Zhang, R., Li, J.: XRAG: eXamining the Core – Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation (2025). https://arxiv.org/abs/2412. 15529

2025
[36]

arXiv preprint arXiv:2502.18635 (2025)

Barker, M., Bell, A., Thomas, E., Carr, J., Andrews, T., Bhatt, U.: Faster, cheaper, better: Multi-objective hyperparameter optimization for llm and rag systems. arXiv preprint arXiv:2502.18635 (2025)

work page arXiv 2025
[37]

In: 2025 5th International Conference on AI-ML-Systems (AIMLSystems), pp

Bhowmick, A., Singhal, R.: Optirag: Optimized rag configuration under per- formance constraints. In: 2025 5th International Conference on AI-ML-Systems (AIMLSystems), pp. 318–323 (2025). IEEE 33

2025
[38]

arXiv preprint arXiv:2502.13595 (2025) https://doi.org/10.48550/arXiv.2502.13595

Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur, A., Stap, D., Gala, J., Siblini, W., Krzemi´ nski, D., Winata, G.I., Sturua, S., Utpala, S., Ciancone, M., Schaeffer, M., Sequeira, G., Misra, D., Dhakal, S., Rystrøm, J., Solomatin, R., C ¸ a˘ gatan, Kundu, A., Bernstorff, M., Xiao, S., Sukhlecha, A., Pahwa, B., Po´ swiata, R., GV, K.K., Ashraf,...

work page doi:10.48550/arxiv.2502.13595 2025
[39]

In: 2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC), pp

Krishnan, A., Pasumarti, V., Inamdar, S., Mondal, A., Nambiar, M., Singhal, R.: Car-llm: Cloud accelerator recommender for large language models. In: 2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 89–99 (2024). IEEE 34

2024

[1] [1]

Wimusim: simulat- ing realistic variabilities in wearable imus for human activity recognition.Frontiers in Computer Science, V olume 7 - 2025, 2025

Saleh, Y., Abu Talib, M., Nasir, Q., Dakalbab, F.: Evaluating large language mod- els: a systematic review of efficiency, applications, and future directions. Frontiers in Computer ScienceV olume 7 - 2025(2025) https://doi.org/10.3389/fcomp. 2025.1523699

work page doi:10.3389/fcomp 2025

[2] [2]

arXiv preprint arXiv:2505.19240 (2025)

Kostikova, A., Wang, Z., Bajri, D., P¨ utz, O., Paaßen, B., Eger, S.: LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models. arXiv preprint arXiv:2505.19240 (2025)

work page arXiv 2025

[3] [3]

https://arxiv.org/ abs/2407.21059

Gao, Y., Xiong, Y., Wang, M., Wang, H.: Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks (2024). https://arxiv.org/ abs/2407.21059

work page arXiv 2024

[4] [4]

In: Companion Proceedings of the ACM on Web Con- ference 2025

Jin, J., Zhu, Y., Dou, Z., Dong, G., Yang, X., Zhang, C., Zhao, T., Yang, Z., Wen, J.-R.: FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research. In: Companion Proceedings of the ACM on Web Con- ference 2025. WWW ’25, pp. 737–740. Association for Computing Machin- ery, New York, NY, USA (2025). https://doi.org/10.1145/3701716.37...

work page doi:10.1145/3701716.3715313 2025

[5] [5]

In: Proceedings of the 31st International Conference on Neural Information Processing Systems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention Is All You Need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17, pp. 6000–6010. Curran Associates Inc., Red Hook, NY, USA (2017)

2017

[6] [6]

In: Burstein, J., Doran, C., Solorio, T

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...

2019

[7] [7]

Language Models are Few-Shot Learners

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[8] [8]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., 30 Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G.: LLaMA: Open and Efficient Foundation Language Models (2023). https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

arXiv preprint arXiv:2501.04040 (2025)

Matarazzo, A., Torlone, R.: A survey on large language models with some insights on their capabilities and limitations. arXiv preprint arXiv:2501.04040 (2025)

work page arXiv 2025

[10] [10]

arXiv preprint arXiv:2509.23936 (2025)

Yuan, Z., Ding, Z., Vlachos, A.: Assessing large language models in updating their forecasts with new information. arXiv preprint arXiv:2509.23936 (2025)

work page arXiv 2025

[11] [11]

arXiv preprint arXiv:2505.07968 (2025)

Wu, W., Xu, X., Gao, C., Diao, X., Li, S., Salas, L.A., Gui, J.: Assessing and mitigating medical knowledge drift and conflicts in large language models. arXiv preprint arXiv:2505.07968 (2025)

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2403.08763 (2024)

Ibrahim, A., Th´ erien, B., Gupta, K., Richter, M.L., Anthony, Q., Lesort, T., Belilovsky, E., Rish, I.: Simple and Scalable Strategies to Continually Pre-train Large Language Models. arXiv preprint arXiv:2403.08763 (2024)

work page arXiv 2024

[13] [13]

In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., Li, Q.: A survey on rag meeting llms: Towards retrieval-augmented large language models. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’24, pp. 6491–6501. Association for Computing Machin- ery, New York, NY, USA (2024). https://doi.org/10.1...

work page doi:10.1145/3637528.3671470 2024

[14] [14]

In: Proceedings of the 34th Interna- tional Conference on Neural Information Processing Systems

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K¨ uttler, H., Lewis, M., Yih, W.-t., Rockt¨ aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Proceedings of the 34th Interna- tional Conference on Neural Information Processing Systems. NIPS ’20. Curran Associates Inc., Red Hoo...

2020

[15] [15]

Lost in the Middle: How Language Models Use Long Contexts

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A.M., Uszkoreit, J., Le, Q., Petrov, S.: Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics7...

work page internal anchor Pith review doi:10.1162/tacl 2019

[16] [16]

In: Blunsom, P., Bordes, A., Cho, K., Cohen, S., Dyer, C., Grefenstette, E., Hermann, K.M., Rimell, L., Weston, J., Yih, S

Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., Sule- man, K.: NewsQA: A machine comprehension dataset. In: Blunsom, P., Bordes, A., Cho, K., Cohen, S., Dyer, C., Grefenstette, E., Hermann, K.M., Rimell, L., Weston, J., Yih, S. (eds.) Proceedings of the 2nd Workshop on Rep- resentation Learning for NLP, pp. 191–200. Association fo...

work page doi:10.18653/v1/w17-2623 2017

[17] [17]

Gemma Team

Joshi, M., Choi, E., Weld, D., Zettlemoyer, L.: TriviaQA: A large scale dis- tantly supervised challenge dataset for reading comprehension. In: Barzilay, R., Kan, M.-Y. (eds.) Proceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611. Association for Computational Linguistics, Vancouve...

work page doi:10.18653/v1/p17-1147 2017

[18] [18]

arXiv preprint arXiv:2402.01763 (2025)

Jing, Z., Su, Y., Han, Y.: When Large Language Models Meet Vector Databases: A Survey. arXiv preprint arXiv:2402.01763 (2025)

work page arXiv 2025

[19] [19]

arXiv preprint arXiv:2312.09890 (2023)

Nastase, V., Merlo, P.: Grammatical information in bert sentence embeddings as two-dimensional arrays. arXiv preprint arXiv:2312.09890 (2023)

work page arXiv 2023

[20] [20]

(eds.) Findings of the Association for Computational Linguistics: NAACL 2025, pp

Qu, R., Tu, R., Bao, F.S.: Is semantic chunking worth the computational cost? In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) Findings of the Association for Computational Linguistics: NAACL 2025, pp. 2155–2177. Association for Compu- tational Linguistics, Albuquerque, New Mexico (2025). https://doi.org/10.18653/ v1/2025.findings-naacl.114 .https://aclantho...

2025

[21] [21]

ACM Transactions on Information Systems42(1), 1–35 (2023)

Bruch, S., Gai, S., Ingber, A.: An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems42(1), 1–35 (2023)

2023

[22] [22]

Journal of Documentation28(1), 11–21 (1972) https://doi.org/10

Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation28(1), 11–21 (1972) https://doi.org/10. 1108/eb026526

1972

[23] [23]

arXiv preprint arXiv:2104.07186 (2021)

Gao, L., Dai, Z., Callan, J.: COIL: Revisit exact lexical match in informa- tion retrieval with contextualized inverted list. arXiv preprint arXiv:2104.07186 (2021)

work page arXiv 2021

[24] [24]

International Journal of Applied Engineering Research9(2018)

D., A.: Semantic Similarity- A Review of Approaches and Metrics. International Journal of Applied Engineering Research9(2018)

2018

[25] [25]

Passage Re-ranking with BERT

Nogueira, R., Cho, K.: Passage Re-ranking with BERT (2020). https://arxiv.org/ abs/1901.04085

work page internal anchor Pith review Pith/arXiv arXiv 2020

[26] [26]

doi: 10.1145/3703155

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language mod- els: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 43(2) (2025) https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025

[27] [27]

In: Aletras, N., De Clercq, O

Es, S., James, J., Espinosa Anke, L., Schockaert, S.: RAGAs: Automated Evaluation of Retrieval Augmented Generation. In: Aletras, N., De Clercq, O. (eds.) Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp. 150–158. Association for Computational Linguistics, St. Julians, M...

2024

[28] [28]

The VLDB Journal33(5), 1591–1615 (2024)

Pan, J.J., Wang, J., Li, G.: Survey of vector database management systems. The VLDB Journal33(5), 1591–1615 (2024)

2024

[29] [29]

The Faiss library

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazar´ e, P.-E., Lomeli, M., Hosseini, L., J´ egou, H.: The Faiss library. arXiv preprint arXiv:2401.08281 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

In: Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pp

Mao, Y., He, J., Chen, C.: From prompts to templates: A systematic prompt template analysis for real-world LLMapps. In: Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pp. 75–86 (2025)

2025

[31] [31]

In: Hernandez Farias, D.I., Hope, T., Li, M

Zhang, X., Song, Y.-Z., Wang, Y., Tang, S., Li, X., Zeng, Z., Wu, Z., Ye, W., Xu, W., Zhang, Y., Dai, X., Zhang, S., Wen, Q.: RAGLAB: A mod- ular and research-oriented unified framework for retrieval-augmented genera- tion. In: Hernandez Farias, D.I., Hope, T., Li, M. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Proce...

work page doi:10.18653/v1/2024.emnlp-demo.43 2024

[32] [32]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,et al.: LoRA: Low-Rank Adaptation of Large Language Models. ICLR1(2), 3 (2022)

2022

[33] [33]

QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient Finetuning of Quantized LLMs. URL https://arxiv. org/abs/2305.143142(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

In: European Conference on Information Retrieval, pp

Pradeep, R., Thakur, N., Sharifymoghaddam, S., Zhang, E., Nguyen, R., Campos, D., Craswell, N., Lin, J.: Ragnar¨ ok: A reusable RAG framework and baselines for TREC 2024 retrieval-augmented generation track. In: European Conference on Information Retrieval, pp. 132–148 (2025). Springer

2024

[35] [35]

https://arxiv.org/abs/2412

Mao, Q., Luo, Y., Zhang, Q., Luo, Y., Cao, Z., Zhang, J., Hao, H., Chen, Z., Jiang, W., Liu, J., Wang, X., Huang, Z., Tan, Z., Jie, S., Li, B., Liu, X., Zhang, R., Li, J.: XRAG: eXamining the Core – Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation (2025). https://arxiv.org/abs/2412. 15529

2025

[36] [36]

arXiv preprint arXiv:2502.18635 (2025)

Barker, M., Bell, A., Thomas, E., Carr, J., Andrews, T., Bhatt, U.: Faster, cheaper, better: Multi-objective hyperparameter optimization for llm and rag systems. arXiv preprint arXiv:2502.18635 (2025)

work page arXiv 2025

[37] [37]

In: 2025 5th International Conference on AI-ML-Systems (AIMLSystems), pp

Bhowmick, A., Singhal, R.: Optirag: Optimized rag configuration under per- formance constraints. In: 2025 5th International Conference on AI-ML-Systems (AIMLSystems), pp. 318–323 (2025). IEEE 33

2025

[38] [38]

arXiv preprint arXiv:2502.13595 (2025) https://doi.org/10.48550/arXiv.2502.13595

Enevoldsen, K., Chung, I., Kerboua, I., Kardos, M., Mathur, A., Stap, D., Gala, J., Siblini, W., Krzemi´ nski, D., Winata, G.I., Sturua, S., Utpala, S., Ciancone, M., Schaeffer, M., Sequeira, G., Misra, D., Dhakal, S., Rystrøm, J., Solomatin, R., C ¸ a˘ gatan, Kundu, A., Bernstorff, M., Xiao, S., Sukhlecha, A., Pahwa, B., Po´ swiata, R., GV, K.K., Ashraf,...

work page doi:10.48550/arxiv.2502.13595 2025

[39] [39]

In: 2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC), pp

Krishnan, A., Pasumarti, V., Inamdar, S., Mondal, A., Nambiar, M., Singhal, R.: Car-llm: Cloud accelerator recommender for large language models. In: 2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 89–99 (2024). IEEE 34

2024