arxiv: 2604.17943 · v1 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents

Bao Gia Doan , Aditya Joshi , Pantelis Elinas , Aarya Bodhankar , Oscar Leslie , Tom Marchant , Flora Salim

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords DoRARAG benchmarkingdefense documentssynthetic QAhallucination reductiondomain shiftretrieval-augmented generationquestion answering

0 comments

The pith

A model fine-tuned on the DoRA benchmark achieves up to 26% higher QA success and 47% lower hallucination rates on defense documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DoRA, a benchmark built from defense documents that generates synthetic but intent-conditioned questions paired with traceable evidence passages. It tests retrieval-augmented generation across five question types and 6.5K instances to check both answer quality and source attribution. General language models perform similarly on this data, yet fine-tuning one on the DoRA examples produces clear gains in task success and fewer fabricated responses. This matters because open-domain benchmarks often inflate scores due to pretraining overlap, so domain-specific tests are needed to catch failures when models face unfamiliar content. The setup allows regression testing that accounts for contamination when models shift to new domains.

Core claim

DoRA is a domain-grounded benchmark with 6.5K synthetic instances that pairs intent-conditioned QA with auditable evidence passages. In end-to-end evaluation with a fixed dense retriever, general-purpose language models perform similarly to each other. A model trained on DoRA data, however, yields up to 26% improvement in QA task success over the base Llama3.1-8B-Instruct while reducing the hallucination rate by 47% in RAG faithfulness scores, supporting contamination-aware regression testing under domain shift.

What carries the argument

The DoRA benchmark of synthetic intent-conditioned QA pairs paired with curated evidence passages for attribution verification, covering five question types: find, explain, summarize, generate, provide.

If this is right

General-purpose language models show comparable performance when evaluated end-to-end on DoRA with a fixed retriever.
Fine-tuning on DoRA data produces up to 26% gains in QA task success.
RAG faithfulness scores improve with a 47% drop in hallucination rate after DoRA training.
The benchmark enables contamination-aware regression testing when models encounter domain shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Domain-specific synthetic benchmarks could be extended to other restricted fields such as legal or medical documents to test RAG reliability without large real-query collections.
The hallucination reduction indicates that training on traceable attribution examples may strengthen evidence adherence more broadly.
If the five question types cover most real defense inquiries, similar synthetic construction could lower the cost of building reliable domain tests.
Public benchmarks that ignore domain shift may systematically overestimate deployment readiness for specialized content.

Load-bearing premise

The synthetic intent-conditioned QA pairs and curated evidence passages faithfully represent the distribution and attribution challenges of real user queries on defense documents without introducing generation artifacts or selection bias.

What would settle it

Evaluating the DoRA-trained model on a held-out set of actual human-generated questions from defense document users and finding no improvement in success rate or no reduction in hallucination would show that the synthetic data fails to capture real performance.

Figures

Figures reproduced from arXiv: 2604.17943 by Aarya Bodhankar, Aditya Joshi, Bao Gia Doan, Flora Salim, Oscar Leslie, Pantelis Elinas, Tom Marchant.

**Figure 1.** Figure 1: DoRA pipeline, from data preparation to grounded-styled QA generation, and downstream domain evaluation and adaptation. (DoRA SFT). Beyond evaluation, we show how DoRA can serve as SFT-ready supervision: a LoRA-adapted open model trained on DoRA improves both task success and faithfulness diagnostics over strong general-purpose baselines under the same retriever setting, supporting an industry workflow o… view at source ↗

**Figure 2.** Figure 2: Retriever performance across top-k on DoRA. GTE retriever achieved overall better performance. instances as the benchmark evaluation set; summary statistics are shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Our DoRA SFT model vs ICL baselines. footprint and prior adoption in defense and national security AI contexts (Meta AI, 2024; Kapko, 2024). 5.4 Skyline Comparison with Manually Curated Dataset Finally, as a skyline where high-precision expert supervision is available, we curate an expertannotated set from the same seed documents. A domain expert authors 25 seed Q&A pairs (5 per question type), which we … view at source ↗

**Figure 4.** Figure 4: Prompt template used for generating QA pairs with In-Context Learning. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt template used for judging the quality of generated question and answers conditioned on the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Open-domain RAG benchmarks over public corpora can overestimate deployment performance due to pretraining overlap and weak attribution requirements. We present DoRA (Domain-oriented RAG Assessment), a domain-grounded benchmark built from defense documents that pairs synthetic, intent-conditioned QA (question answering) with auditable evidence passages for attribution. DoRA covers five question types (find, explain, summarize, generate, provide) and contains 6.5K curated instances. In end-to-end evaluation with a fixed dense retriever, general-purpose Language Models (LMs) perform similarly, while a model trained on DoRA (DoRA SFT) yields large gains over the base model (Llama3.1-8B-Instruct): up to 26% improvement in QA task success, while reducing the hallucination rate by 47% in RAG faithfulness scores, supporting contamination-aware regression testing under domain shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DoRA builds a synthetic benchmark for RAG on defense documents and reports solid gains from fine-tuning on it, but those gains sit on generated data with no clear check against real query distributions.

read the letter

The paper introduces DoRA, a 6.5K-instance synthetic benchmark built from defense documents. It generates intent-conditioned questions across five types (find, explain, summarize, generate, provide) and pairs them with curated evidence passages for attribution testing. A model fine-tuned on DoRA shows up to 26% better QA success and 47% lower hallucination rates than the base Llama 3.1-8B-Instruct when both are run with the same retriever on the benchmark itself.

Referee Report

2 major / 2 minor

Summary. The paper introduces DoRA, a synthetic benchmark of 6.5K intent-conditioned QA pairs derived from defense documents and paired with auditable evidence passages across five question types. It reports that general-purpose LMs perform similarly on this benchmark with a fixed dense retriever, while a model fine-tuned on DoRA (DoRA SFT) achieves up to 26% higher QA task success and 47% lower hallucination rates in RAG faithfulness scores compared to the Llama3.1-8B-Instruct base model.

Significance. If the synthetic data is shown to faithfully represent real defense query distributions without generation artifacts, DoRA could provide a useful contamination-aware benchmark for domain-specific RAG evaluation and fine-tuning, addressing limitations of public-corpus benchmarks.

major comments (2)

[Abstract and Results] The headline DoRA SFT results (26% QA improvement, 47% hallucination reduction) are measured on the same 6.5K synthetic instances used for SFT. This does not demonstrate generalization to held-out queries or real user distributions and directly undermines the claim of improved RAG behavior under domain shift.
[Benchmark Construction] No quantitative checks (e.g., KL divergence to real query logs, expert fidelity ratings, or artifact detection) are described for whether the intent-conditioned synthetic QA pairs and curated evidence passages match the statistical properties of actual defense document queries, including question-type distribution and attribution difficulty.

minor comments (2)

[Abstract] The abstract states specific percentage improvements without supplying evaluation protocol details, baseline comparisons, statistical tests, or error analysis.
[Evaluation] Clarify the exact definitions and computation of 'QA task success' and 'RAG faithfulness scores' and whether a train/test split was used for the SFT evaluation.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point-by-point below, agreeing with the concerns where valid and outlining specific revisions to strengthen the manuscript without overstating our claims.

read point-by-point responses

Referee: [Abstract and Results] The headline DoRA SFT results (26% QA improvement, 47% hallucination reduction) are measured on the same 6.5K synthetic instances used for SFT. This does not demonstrate generalization to held-out queries or real user distributions and directly undermines the claim of improved RAG behavior under domain shift.

Authors: We acknowledge that the reported DoRA SFT results were computed on the full set of 6.5K synthetic instances used for fine-tuning, which limits direct evidence of generalization to held-out queries. To address this, we will revise the manuscript to include an explicit train/test split (e.g., 80/20) of the DoRA benchmark, with all headline metrics recomputed on the unseen test portion. The abstract and results sections will be updated accordingly, and claims about 'domain shift' will be qualified to refer specifically to performance gains on this synthetic benchmark for contamination-aware evaluation rather than broad generalization to operational user distributions. We note that real defense query logs remain inaccessible due to classification constraints. revision: yes
Referee: [Benchmark Construction] No quantitative checks (e.g., KL divergence to real query logs, expert fidelity ratings, or artifact detection) are described for whether the intent-conditioned synthetic QA pairs and curated evidence passages match the statistical properties of actual defense document queries, including question-type distribution and attribution difficulty.

Authors: We agree that additional validation metrics would improve the benchmark description. Due to the sensitive and classified nature of the source defense documents, real query logs are unavailable, precluding KL divergence or direct statistical matching to operational distributions. We will expand the benchmark construction section with: (1) explicit reporting of question-type balance across the five categories, (2) details on evidence passage curation for attribution, (3) basic statistical summaries (lengths, vocabulary overlap) and post-generation filtering steps to address artifact detection, and (4) a limitations paragraph noting the absence of expert fidelity ratings. These additions will be quantitative where possible within the constraints of the data. revision: partial

standing simulated objections not resolved

Quantitative comparison (e.g., KL divergence) to real defense query logs, as such logs are inaccessible due to classification and security restrictions.
Expert fidelity ratings on the synthetic QA pairs, as this would require domain-expert access to classified materials not available during the original study.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct model comparisons

full rationale

The paper constructs a synthetic benchmark (DoRA) from defense documents and reports empirical performance of models including a fine-tuned variant (DoRA SFT) versus the base Llama-3.1-8B-Instruct. No mathematical derivation chain, self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations exist. Central claims are direct end-to-end QA and faithfulness metrics on the constructed instances, without any reduction of results to inputs by construction. This is a standard empirical benchmark paper whose claims remain independent of the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; full paper would likely add more assumptions about synthetic data quality and retriever choice.

axioms (1)

domain assumption A fixed dense retriever is sufficient and representative for end-to-end RAG evaluation on defense documents
Stated in the evaluation description.

invented entities (1)

DoRA benchmark no independent evidence
purpose: Domain-specific synthetic QA test set with attribution
Newly constructed collection of 6.5K instances

pith-pipeline@v0.9.0 · 5478 in / 1315 out tokens · 63866 ms · 2026-05-10T04:47:26.021886+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 15 canonical work pages · 4 internal anchors

[1]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations , year =

RAGAs: Automated Evaluation of Retrieval Augmented Generation , author =. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations , year =
[2]

Medical Graph RAG : Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation

Wu, Junde and Zhu, Jiayuan and Qi, Yunli and Chen, Jingkun and Xu, Min and Menolascina, Filippo and Jin, Yueming and Grau, Vicente. Medical Graph RAG : Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. ...

work page doi:10.18653/v1/2025.acl-long.1381 2025
[3]

Generating

Filice, Simone and Horowitz, Guy and Carmel, David and Karnin, Zohar and Lewin-Eytan, Liane and Maarek, Yoelle , booktitle =. Generating
[4]

arXiv preprint arXiv:2505.14212 , year=

Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks , author=. arXiv preprint arXiv:2505.14212 , year=

work page arXiv
[5]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering , author =. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
[6]

Laradji , booktitle=

Amirhossein Abaskohi and Spandana Gella and Giuseppe Carenini and Issam H. Laradji , booktitle=. 2025 , url=

2025
[7]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

Synthetic Multimodal Question Generation , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =. 2024 , address =

2024
[8]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
[9]

Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track , year =

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation , author =. Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track , year =

2024
[10]

Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen

SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction , author=. arXiv preprint arXiv:2503.01478 , year=

work page arXiv
[11]

Faitheval: Can your language model stay faithful to context, even if” the moon is made of marshmallows”.arXiv preprint arXiv:2410.03727,

FaithEval: Can Your Language Model Stay Faithful to Context, Even If “The Moon is Made of Marshmallows” , author=. arXiv preprint arXiv:2410.03727 , year=

work page arXiv
[12]

arXiv preprint arXiv:2409.03759v1 , year=

VERA: Validation and Evaluation of Retrieval-Augmented systems , author=. arXiv preprint arXiv:2409.03759v1 , year=

work page arXiv
[13]

Advances in Neural Information Processing Systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=
[14]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao and Yun Xiong and Xinyu Gao and Kangxiang Jia and Jinliu Pan and Yuxi Bi and Yi Dai and Jiawei Sun and Qianyu Guo and Meng Wang and Haofen Wang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.10997 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2023
[15]

KILT : a benchmark for knowledge intensive language tasks

Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt\". Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua...

work page doi:10.18653/v1/2021.naacl-main.200 2021
[16]

Transactions of the Association for Computational Linguistics , volume=

Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=
[17]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

SQuAD: 100,000+ questions for machine comprehension of text , author=. arXiv preprint arXiv:1606.05250 , year=

work page internal anchor Pith review arXiv
[18]

and Artzi, Yoav , journal=

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , journal=. BERTScore: Evaluating Text Generation with
[19]

Sellam, Thibault and Das, Dipanjan and Parikh, Ankur , journal=
[20]

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

2018
[22]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Benchmarking large language models in retrieval-augmented generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[24]

Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models

CRUD-RAG: A comprehensive chinese benchmark for retrieval-augmented generation of large language models , author=. arXiv preprint arXiv:2401.17043 , year=

work page arXiv
[25]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems , author =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =. 2024 , address =

2024
[26]

2024 , eprint=

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems , author=. 2024 , eprint=

2024
[27]

MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024

Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries , author=. arXiv preprint arXiv:2401.15391 , year=

work page arXiv
[28]

2021 , eprint=

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. 2021 , eprint=

2021
[29]

2008 , publisher =

Introduction to Information Retrieval , author =. 2008 , publisher =

2008
[30]

and Zhang, Tianyi and Liang, Percy , title =

Liu, Nelson F. and Zhang, Tianyi and Liang, Percy , title =. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =. 2023 , doi =

2023
[31]

ZeroGen: Efficient Zero-shot Learning via Dataset Generation , booktitle =

ZeroGen: Efficient Zero-shot Learning via Dataset Generation , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , month = dec, year =. doi:10.18653/v1/2022.emnlp-main.801 , pages =

work page doi:10.18653/v1/2022.emnlp-main.801 2022
[32]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , month = may, year =

Prompting-based Synthetic Data Generation for Few-Shot Question Answering , author =. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , month = may, year =

2024
[33]

Text summarization branches out , pages=

ROUGE: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
[34]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

BLEU: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
[35]

1997 , publisher =

Dan Gusfield , title =. 1997 , publisher =

1997
[36]

2024 , howpublished =

Introducing Llama 3.1: Open Foundation Models at Scale , author =. 2024 , howpublished =

2024
[37]

government for national security , author =

Meta offers Llama AI models to U.S. government for national security , author =. CIO , year =
[38]

Foundations and Trends in Information Retrieval , volume =

The Probabilistic Relevance Framework: BM25 and Beyond , author =. Foundations and Trends in Information Retrieval , volume =. 2009 , publisher =

2009
[39]

2025 , howpublished =

mGTE: Generalized Text Embedding Models for 75 Languages and 8k Context Length , author =. 2025 , howpublished =

2025
[40]

2025 , howpublished =

MiniCPM-Embedding: A Bilingual and Cross-Lingual Text Embedding Model , author =. 2025 , howpublished =

2025
[41]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author =. arXiv preprint arXiv:2402.03216 , year =

work page internal anchor Pith review arXiv
[42]

arXiv preprint arXiv:2508.21085 , year =

Granite Embedding R2 Models , author =. arXiv preprint arXiv:2508.21085 , year =

work page arXiv
[43]

CLEAN EVAL : Clean Evaluation on Contaminated Large Language Models

Zhu, Yifeng and Liu, Yiqi and Yang, Jiashuo and Jia, Mengzhou and Wang, Minjie and Li, Chao and Li, Jia and Wong, Kam-Fai and Liu, Zitao. CLEAN EVAL : Clean Evaluation on Contaminated Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024

2024
[44]

An Open-Source Data Contamination Report for Large Language Models

Li, Yixiao and van Leeuwen, Mathijs and Drevon, Gabriel and van der Wees, Marlies and Tang, Yixuan and Bulian, Jannis and Di Gangi, Mattia and Cahyawijaya, Samuel and Laskar, Md Tahmid Rahman and Pan, Jianguo and Zhang, Wenyi and Michel, Paul and Neubig, Graham and Weller, Orion. An Open-Source Data Contamination Report for Large Language Models. Findings...

2024
[45]

Extracting Training Data from Large Language Models , booktitle =

Carlini, Nicholas and Tramer, Florian and Wallace, Eric and Jagielski, Matthew and Herbert-Voss, Ariel and Lee, Katherine and Roberts, Adam and Brown, Tom and Song, Dawn and Erlingsson,. Extracting Training Data from Large Language Models , booktitle =. 2021 , address =

2021
[46]

and Marques, Jo \ a o DS and Gra c a, Miguel and Freire, Miguel and Li, Lei and Oliveira, Arlindo L

Duarte, Andr \'e V. and Marques, Jo \ a o DS and Gra c a, Miguel and Freire, Miguel and Li, Lei and Oliveira, Arlindo L. L umber C hunker: Long-Form Narrative Document Segmentation. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.377

work page doi:10.18653/v1/2024.findings-emnlp.377 2024
[47]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

2019