arxiv: 2605.01664 · v1 · submitted 2026-05-03 · 💻 cs.IR

Recognition: unknown

A Hybrid Retrieval and Reranking Framework for Evidence-Grounded Retrieval-Augmented Generation

Fariba Afrin Irany , Sampson Akwafuo

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:12 UTC · model grok-4.3

classification 💻 cs.IR

keywords retrieval-augmented generationhybrid retrievalrerankingevidence groundingbiomedical question answeringclaim-level evaluationRAG frameworkcitation-aware responses

0 comments

The pith

Hybrid retrieval and reranking in RAG achieves 100% grounding accuracy for biomedical document questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a framework that combines hybrid retrieval from document collections with reranking to prioritize relevant evidence, then generates answers and verifies each claim using a separate judge model. It aims to make retrieval-augmented generation more reliable by ensuring every factual statement in the response is backed by the retrieved sources. A sympathetic reader would care because it addresses the common problem of ungrounded or hallucinated outputs in AI systems used for answering questions from specialized documents like medical literature. The evaluation on 25 queries extracted 200 claims with perfect support from the evidence, suggesting the approach works when source material is sufficient.

Core claim

The authors present a hybrid retrieval and reranking framework that ingests PDF documents into a knowledge base, retrieves and reranks evidence chunks using hybrid search and Cohere reranker, generates answers from top evidence, and uses a judge model to verify each factual claim. In a pilot evaluation with 25 biomedical queries, this process yielded 500 retrieved chunks and 200 claims, all judged supported by the evidence, for 100% grounding accuracy. The framework supports citation-aware, evidence-grounded RAG when sufficient source evidence is available.

What carries the argument

The hybrid retrieval-reranking pipeline combined with claim-level verification by a judge model.

Load-bearing premise

The separate judge model can accurately and without bias determine whether each generated claim is supported by the evidence, and the selected 25 queries and extracted claims are representative of real-world use cases.

What would settle it

Applying the judge model to claims that are known to be unsupported by the evidence and observing whether it still approves them, or running the full pipeline on a much larger and more varied set of queries and finding accuracy below 100%.

read the original abstract

Retrieval-augmented generation (RAG) improves large language model reliability by grounding generated responses in external evidence. However, RAG performance depends on the relevance of retrieved passages, the quality of evidence ranking, and the ability to verify whether generated claims are supported by source documents. This study presents a hybrid retrieval and reranking framework for citation-aware RAG in biomedical and healthcare-related document question answering. The framework uses Amazon Bedrock Knowledge Bases for document ingestion, parsing, chunking, embedding generation, and evidence retrieval. Source PDF documents are stored in Amazon S3, embedded using Amazon Titan Text Embeddings V2, and indexed with Amazon OpenSearch Serverless. Hybrid retrieval first retrieves candidate evidence chunks, and Cohere reranking then prioritizes the most relevant passages before answer generation. The answer-generation stage uses top-ranked evidence chunks to produce controlled, evidence-grounded responses, while a separate judge model evaluates each generated factual claim against the retrieved evidence. The framework was evaluated using 25 biomedical NLP and healthcare transformer queries as a pilot-scale proof-of-concept study. Across the evaluation set, the system retrieved and reranked 500 evidence chunks and generated answers from top-ranked evidence. Claim-level grounding evaluation extracted 200 factual claims, all of which were judged to be supported by retrieved evidence, resulting in 100.0% grounding accuracy. The results suggest that hybrid retrieval, reranking, conservative prompting, and claim-level evaluation can support reliable evidence-grounded RAG responses when sufficient source evidence is available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A small pilot wiring standard RAG pieces together for biomedical docs, with a 100% grounding number that depends on an unvalidated judge model.

read the letter

The paper describes an end-to-end RAG pipeline built on Amazon Bedrock and OpenSearch for biomedical question answering. It ingests PDFs, does hybrid retrieval, applies Cohere reranking, generates answers from the top chunks, extracts factual claims, and runs them through a separate LLM judge to check support. On 25 queries the system produced 200 claims and the judge said all were grounded, for a reported 100% accuracy. That is the core result they highlight. The work is mostly an applied integration of existing components rather than any new retrieval algorithm or theoretical advance. What it does reasonably well is spell out the concrete choices around chunking, embedding, and the claim-level verification step, which can give practitioners a usable template when they need to ground outputs in source documents for healthcare use. The evaluation stays at pilot scale with no baselines, no out-of-distribution tests, and no error analysis. The 100% figure comes entirely from the judge model, yet the description provides no prompt details, temperature settings, model identity, or human validation of the judge's decisions. Because the generator and judge are both LLMs looking at the same retrieved chunks, any shared bias or leniency would go undetected. That makes the central claim hard to interpret without more checks. The paper does not introduce new math, formal proofs, or large-scale reproducible experiments. It is aimed at engineers and teams building production RAG systems in biomedicine who want a worked example of hybrid retrieval plus claim verification. Researchers hunting for novel methods or rigorous benchmarks will find little to take away. It is coherent on its own terms as an engineering report, so it could reasonably go to peer review as an applications or systems paper, but any referee would need to see judge validation and at least basic comparisons before the grounding result could be taken as strong evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a hybrid retrieval and reranking framework for evidence-grounded RAG in biomedical and healthcare document QA. It integrates Amazon Bedrock Knowledge Bases for ingestion/parsing/embedding with Titan V2 and OpenSearch, applies hybrid retrieval followed by Cohere reranking, generates answers from top evidence, and uses a separate judge model to verify each generated factual claim. In a pilot evaluation on 25 queries, the system processed 500 chunks and reported 100% grounding accuracy across 200 extracted claims.

Significance. If the grounding result is substantiated, the framework illustrates a practical, tool-orchestrated pipeline that combines hybrid search, reranking, and claim-level verification to support reliable citation-aware generation in a high-stakes domain. Such an approach could serve as a template for production RAG systems where evidence sufficiency can be assured.

major comments (2)

[Abstract] Abstract and evaluation description: the central claim of 100.0% grounding accuracy rests entirely on judgments from an automated judge model whose identity, prompt, temperature, decision criteria, and validation (human agreement, error analysis, or false-positive checks) are not described. Because the generator and judge are both LLMs operating on the same retrieved chunks, systematic leniency or shared hallucination patterns cannot be ruled out.
[Evaluation] Evaluation: the result is derived from only 25 queries and 200 claims with no information on query selection, claim extraction process, or whether the sample is representative. No baseline comparisons (e.g., retrieval-only or reranking-only) or out-of-distribution tests are reported, so the contribution of each pipeline stage to the accuracy figure cannot be isolated.

minor comments (2)

The manuscript would benefit from explicit discussion of related hybrid-retrieval and claim-verification literature to situate the engineering choices.
Clarify the exact number of top-ranked chunks passed to generation and any hyperparameters of the reranker or conservative prompt.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our pilot study. We address each major comment below and will revise the manuscript to improve transparency on the judge model and evaluation details.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: the central claim of 100.0% grounding accuracy rests entirely on judgments from an automated judge model whose identity, prompt, temperature, decision criteria, and validation (human agreement, error analysis, or false-positive checks) are not described. Because the generator and judge are both LLMs operating on the same retrieved chunks, systematic leniency or shared hallucination patterns cannot be ruled out.

Authors: We agree the judge model was under-described. In revision we will specify the exact LLM (via Amazon Bedrock), full prompt template for claim-level verification, temperature (set to 0), and binary decision criteria with required explanation. We will also add an explicit limitations paragraph noting that, while the judge prompt is distinct and evidence-focused, shared LLM biases cannot be fully excluded without human validation, which was outside pilot scope. revision: yes
Referee: [Evaluation] Evaluation: the result is derived from only 25 queries and 200 claims with no information on query selection, claim extraction process, or whether the sample is representative. No baseline comparisons (e.g., retrieval-only or reranking-only) or out-of-distribution tests are reported, so the contribution of each pipeline stage to the accuracy figure cannot be isolated.

Authors: The work is presented as a pilot proof-of-concept, which accounts for the scale. We will expand the evaluation section with query selection criteria (biomedical NLP and healthcare transformer topics), the claim extraction procedure (LLM-based factual statement identification from generated answers), and a statement that the 25-query set is illustrative rather than representative. Component baselines and OOD tests are not included because the focus is end-to-end grounding of the integrated pipeline; we will add a discussion of each stage's role and flag isolation experiments as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical pilot evaluation

full rationale

The paper describes a hybrid retrieval-reranking RAG pipeline and reports an empirical result (100% of 200 extracted claims judged supported) from a 25-query pilot. No equations, parameter fitting, self-definitional loops, or load-bearing self-citations appear in the derivation chain. The grounding accuracy is a direct measurement on the described system rather than a prediction or theorem that reduces to its own inputs by construction. The judge model is external to the core pipeline and its limitations are a validity concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework relies on standard RAG assumptions such as embedding model relevance capture and judge model accuracy, which are not detailed or independently evidenced here.

pith-pipeline@v0.9.0 · 5575 in / 1425 out tokens · 76809 ms · 2026-05-09T17:12:33.660044+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 17 canonical work pages · 2 internal anchors

[1]

MIMIC-IV (version 3.1),

A. Johnson, L. Bulgarelli, T. Pollard, B. Gow, B. Moody, S. Horng, L. A. Celi, and R. Mark, “MIMIC-IV (version 3.1),” PhysioNet,
[2]

[Online]

RRID:SCR 007345. [Online]. Available: https://doi.org/10.13026/ kpb9-mt58
[3]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474. [Online]. Available: https://arxiv.org/abs/2005.11401

work page internal anchor Pith review arXiv 2020
[4]

Retrieve data and generate AI responses with knowledge bases,

Amazon Web Services, “Retrieve data and generate AI responses with knowledge bases,”Amazon Bedrock User Guide, 2026. [On- line]. Available: https://docs.aws.amazon.com/bedrock/latest/userguide/ knowledge-base.html

2026
[5]

Hybrid search,

OpenSearch, “Hybrid search,”OpenSearch Documentation, 2026. [Online]. Available: https://docs.opensearch.org/latest/vector-search/ ai-search/hybrid-search/index/

2026
[6]

Cohere’s rerank model,

Cohere, “Cohere’s rerank model,”Cohere Documentation, 2026. [On- line]. Available: https://docs.cohere.com/docs/rerank

2026
[7]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017. [Online]. Available: https://papers.nips.cc/paper/7181-attention-is-all-you-need

2017
[8]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423/

2019
[9]

REALM: Retrieval-augmented language model pre-training,

K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “REALM: Retrieval-augmented language model pre-training,” inProc. 37th In- ternational Conference on Machine Learning, 2020, pp. 3929–3938. [Online]. Available: https://proceedings.mlr.press/v119/guu20a.html

2020
[10]

Dense passage retrieval for open-domain question answering,

V . Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih, “Dense passage retrieval for open-domain question answering,” inProc. 2020 Conference on Empirical Methods in Nat- ural Language Processing, 2020, pp. 6769–6781. [Online]. Available: https://aclanthology.org/2020.emnlp-main.550/

2020
[11]

Leveraging passage retrieval with generative models for open domain question answering,

G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering,” inProc. 16th Conference of the European Chapter of the Association for Computational Linguis- tics, 2021, pp. 874–880. [Online]. Available: https://aclanthology.org/ 2021.eacl-main.74/

2021
[12]

Improving language models by retrieving from trillions of tokens,

S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. van den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Paganini, G. Irving, O. Vinyals, S. Osindero, K. Simonyan, J. Rae, E. Elsen, and L. Sifre, “Improving langu...

2022
[13]

The P robabilistic R elevance F ramework: BM25 and B eyond

S. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,”Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009. doi: 10.1561/1500000019. [Online]. Available: https://dl.acm.org/doi/abs/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[14]

ColBERT: Efficient and effective passage search via con- textualized late interaction over bert

O. Khattab and M. Zaharia, “ColBERT: Efficient and effective passage search via contextualized late interaction over BERT,” inProc. 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 39–48. doi: 10.1145/3397271.3401075. [Online]. Available: https://dl.acm.org/doi/10.1145/3397271.3401075

work page doi:10.1145/3397271.3401075 2020
[15]

ColBERTv2: Effective and efficient retrieval via lightweight late in- teraction,

K. Santhanam, O. Khattab, J. Saad-Falcon, C. Potts, and M. Zaharia, “ColBERTv2: Effective and efficient retrieval via lightweight late in- teraction,” inProc. 2022 Conference of the North American Chapter of the Association for Computational Linguistics, 2022, pp. 3715–3734. [Online]. Available: https://aclanthology.org/2022.naacl-main.272/

2022
[16]

BEIR: A heterogeneous benchmark for zero-shot evaluation of in- formation retrieval models,

N. Thakur, N. Reimers, A. R ¨uckl´e, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of in- formation retrieval models,” inProc. Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. [Online]. Available: https://openreview.net/forum?id=wCu6T5xFjeJ

2021
[17]

2021.SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

T. Formal, B. Piwowarski, and S. Clinchant, “SPLADE: Sparse lexical and expansion model for first stage ranking,” inProc. 44th International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval, 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/ 3404835.3463098

work page arXiv 2021
[18]

Unsupervised dense information retrieval with contrastive learning,

G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave, “Unsupervised dense information retrieval with contrastive learning,”Transactions on Machine Learning Research, 2022. [Online]. Available: https://openreview.net/forum?id=jKN1pXi7b0

2022
[19]

MS MARCO: A human generated ma- chine reading comprehension dataset,

P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang, “MS MARCO: A human generated ma- chine reading comprehension dataset,” Microsoft Research, 2016. [On- line]. Available: https://www.microsoft.com/en-us/research/publication/ ms-marco-human-gene...

2016
[20]

BioBERT: A pre-trained biomedical language representation model for biomedical text mining,

J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,”Bioinformatics, vol. 36, no. 4, pp. 1234–1240,
[21]

Biobert: A pre-trained biomedical language representation model for biomedical text mining

doi: 10.1093/bioinformatics/btz682. [Online]. Available: https:// academic.oup.com/bioinformatics/article/36/4/1234/5566506

work page doi:10.1093/bioinformatics/btz682
[22]

Sylvie Gibet and Pierre-François Marteau

Y . Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,”ACM Transactions on Comput- ing for Healthcare, vol. 3, no. 1, pp. 1–23, 2021. doi: 10.1145/3458754. [Online]. Available: https://dl.acm.org/doi/10.1145/3458754

work page doi:10.1145/3458754 2021
[23]

ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

K. Huang, J. Altosaar, and R. Ranganath, “ClinicalBERT: Modeling clinical notes and predicting hospital readmission,” 2019. [Online]. Available: https://arxiv.org/abs/1904.05342

work page internal anchor Pith review arXiv 2019
[24]

, Chen, A

X. Yang, A. Chen, N. PourNejatian, H. C. Shin, K. E. Smith, C. Parisien, C. Compas, C. Martin, A. B. Costa, M. G. Flores, Y . Zhang, T. Magoc, C. A. Harle, G. Lipori, D. A. Mitchell, W. R. Hogan, E. A. Shenkman, J. Bian, and Y . Wu, “A large language model for electronic health records,”npj Digital Medicine, vol. 5, article 194, 2022. doi: 10.1038/s41746-...

work page doi:10.1038/s41746-022-00742-2 2022
[25]

npj Digital Medicine6, 210 (11 2023)

C. Peng, X. Yang, A. Chen, K. E. Smith, N. PourNejatian, A. B. Costa, C. Martin, M. G. Flores, Y . Zhang, T. Magoc, G. Lipori, D. A. Mitchell, W. R. Hogan, E. A. Shenkman, J. Bian, and Y . Wu, “A study of generative large language model for medical research and healthcare,”npj Digital Medicine, vol. 6, article 210, 2023. doi: 10.1038/s41746-023-00958-w. [...

work page doi:10.1038/s41746-023-00958-w 2023
[26]

MIMIC-IV , a freely accessible electronic health record dataset,

A. E. W. Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Sham- mout, S. Horng, T. J. Pollard, S. Hao, B. Moody, B. Gow, L. A. Celi, and R. G. Mark, “MIMIC-IV , a freely accessible electronic health record dataset,”Scientific Data, vol. 10, article 1, 2023. doi: 10.1038/s41597-022-01899-x. [Online]. Available: https://www.nature. com/articles/s41597-022-01899-x

work page doi:10.1038/s41597-022-01899-x 2023
[27]

Albers, Dirk Akkermans, Paul Backfried, Sanne Hage, Anastasia Krithara, Georgios Paliouras, et al

G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn, A. Krithara, S. Petridis, D. Poly- chronopoulos, Y . Almirantis, J. Pavlopoulos, N. Baskiotis, P. Galli- nari, T. Arti `eres, A.-C. Ngonga Ngomo, N. Heino, E. Gaussier, L. Barrio-Alvers, M. Schroeder, I. Androutsopoulos, and G. Paliouras, “An overview of t...

work page doi:10.1186/s12859-015-0564-6 2015
[28]

PubMedQA: A dataset for biomedical research question answering,

Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu, “PubMedQA: A dataset for biomedical research question answering,” inProc. 2019 Con- ference on Empirical Methods in Natural Language Processing, 2019, pp. 2567–2577. [Online]. Available: https://aclanthology.org/D19-1259/

2019
[29]

J., Ting, D

A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutier- rez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,”Nature Medicine, vol. 29, no. 8, pp. 1930–1940, 2023. doi: 10.1038/s41591-023-02448-8. [Online]. Available: https://www.nature. com/articles/s41591-023-02448-8

work page doi:10.1038/s41591-023-02448-8 1930
[30]

Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P. Payne, M. Seneviratne, P. Gamble, C. Kelly, A. Babiker, N. Sch ¨arli, A. Chowdhery, P. Mansfield, B. Demner-Fushman, F. Ag ¨uera y Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev, Y . Liu, A. Rajkomar, J. Barra...

work page doi:10.1038/s41586-023-06291-2 2023
[31]

Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum,

J. W. Ayers, A. Poliak, M. Dredze, E. C. Leas, Z. Zhu, J. B. Kelley, D. J. Faix, A. M. Goodman, C. A. Longhurst, M. Hogarth, and D. M. Smith, “Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum,”JAMA Internal Medicine, vol. 183, no. 6, pp. 589–596, 2023. doi: 10.1001/jamaintern- med...

work page doi:10.1001/jamaintern- 2023
[32]

2023 , url =

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, article 248, pp. 1–38, 2023. doi: 10.1145/3571730. [Online]. Available: https://dl.acm. org/doi/10.1145/3571730

work page doi:10.1145/3571730 2023
[33]

FEVER: A large-scale dataset for fact extraction and verification,

J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal, “FEVER: A large-scale dataset for fact extraction and verification,” inProc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics, 2018, pp. 809–819. [Online]. Available: https://aclanthology.org/N18-1074/

2018
[34]

Fact or fiction: Verifying scientific claims,

D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, and H. Hajishirzi, “Fact or fiction: Verifying scientific claims,” inProc. 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 7534–7550. [Online]. Available: https://aclanthology.org/2020. emnlp-main.609/

2020
[35]

FActScore: Fine-grained atomic evalu- ation of factual precision in long form text generation,

S. Min, K. Krishna, X. Lyu, M. Lewis, W. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi, “FActScore: Fine-grained atomic evalu- ation of factual precision in long form text generation,” inProc. 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 12076–12100. [Online]. Available: https://aclanthology.org/ 2023.emnlp-...

2023
[36]

Enabling large language models to generate text with citations,

T. Gao, H. Yen, J. Yu, and D. Chen, “Enabling large language models to generate text with citations,” inProc. 2023 Conference on Empir- ical Methods in Natural Language Processing, 2023, pp. 6465–6488. [Online]. Available: https://aclanthology.org/2023.emnlp-main.398/

2023
[37]

RAGAs: Automated evaluation of retrieval augmented generation,

S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAs: Automated evaluation of retrieval augmented generation,” inProc. 18th Conference of the European Chapter of the Association for Computa- tional Linguistics: System Demonstrations, 2024, pp. 150–158. [Online]. Available: https://aclanthology.org/2024.eacl-demo.16/

2024
[38]

How Amazon Bedrock knowledge bases work,

Amazon Web Services, “How Amazon Bedrock knowledge bases work,” Amazon Bedrock User Guide, 2026. [Online]. Available: https://docs. aws.amazon.com/bedrock/latest/userguide/kb-how-it-works.html

2026
[39]

Working with vector search collec- tions,

Amazon Web Services, “Working with vector search collec- tions,”Amazon OpenSearch Service Developer Guide, 2026. [On- line]. Available: https://docs.aws.amazon.com/opensearch-service/latest/ developerguide/serverless-vector-search.html

2026
[40]

Selective fine-tuning of GPT architec- tures for parameter-efficient clinical text classification,

F. A. Irany and S. Akwafuo, “Selective fine-tuning of GPT architec- tures for parameter-efficient clinical text classification,”arXiv preprint arXiv:2603.14183, 2026. [Online]. Available: https://arxiv.org/abs/2603. 14183

work page arXiv 2026
[41]

From generative modeling to clinical classification: A GPT-based architecture for EHR notes,

F. A. Irany and S. Akwafuo, “From generative modeling to clinical classification: A GPT-based architecture for EHR notes,”arXiv preprint arXiv:2601.21955, 2026. [Online]. Available: https://arxiv.org/abs/2601. 21955

work page arXiv 2026
[42]

Impact of AI on student performance in engineering technology courses,

H. Ahmed, M. Anderson, and M. A. Haider, “Impact of AI on student performance in engineering technology courses,” 2025

2025
[43]

Amazon Bedrock pricing,

Amazon Web Services, “Amazon Bedrock pricing,”Amazon Web Services, 2026. [Online]. Available: https://aws.amazon.com/bedrock/ pricing/

2026
[44]

Amazon OpenSearch Service pricing,

Amazon Web Services, “Amazon OpenSearch Service pricing,”Ama- zon Web Services, 2026. [Online]. Available: https://aws.amazon.com/ opensearch-service/pricing/

2026