Recognition: unknown
Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents
Pith reviewed 2026-05-10 04:47 UTC · model grok-4.3
The pith
A model fine-tuned on the DoRA benchmark achieves up to 26% higher QA success and 47% lower hallucination rates on defense documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DoRA is a domain-grounded benchmark with 6.5K synthetic instances that pairs intent-conditioned QA with auditable evidence passages. In end-to-end evaluation with a fixed dense retriever, general-purpose language models perform similarly to each other. A model trained on DoRA data, however, yields up to 26% improvement in QA task success over the base Llama3.1-8B-Instruct while reducing the hallucination rate by 47% in RAG faithfulness scores, supporting contamination-aware regression testing under domain shift.
What carries the argument
The DoRA benchmark of synthetic intent-conditioned QA pairs paired with curated evidence passages for attribution verification, covering five question types: find, explain, summarize, generate, provide.
If this is right
- General-purpose language models show comparable performance when evaluated end-to-end on DoRA with a fixed retriever.
- Fine-tuning on DoRA data produces up to 26% gains in QA task success.
- RAG faithfulness scores improve with a 47% drop in hallucination rate after DoRA training.
- The benchmark enables contamination-aware regression testing when models encounter domain shift.
Where Pith is reading between the lines
- Domain-specific synthetic benchmarks could be extended to other restricted fields such as legal or medical documents to test RAG reliability without large real-query collections.
- The hallucination reduction indicates that training on traceable attribution examples may strengthen evidence adherence more broadly.
- If the five question types cover most real defense inquiries, similar synthetic construction could lower the cost of building reliable domain tests.
- Public benchmarks that ignore domain shift may systematically overestimate deployment readiness for specialized content.
Load-bearing premise
The synthetic intent-conditioned QA pairs and curated evidence passages faithfully represent the distribution and attribution challenges of real user queries on defense documents without introducing generation artifacts or selection bias.
What would settle it
Evaluating the DoRA-trained model on a held-out set of actual human-generated questions from defense document users and finding no improvement in success rate or no reduction in hallucination would show that the synthetic data fails to capture real performance.
Figures
read the original abstract
Open-domain RAG benchmarks over public corpora can overestimate deployment performance due to pretraining overlap and weak attribution requirements. We present DoRA (Domain-oriented RAG Assessment), a domain-grounded benchmark built from defense documents that pairs synthetic, intent-conditioned QA (question answering) with auditable evidence passages for attribution. DoRA covers five question types (find, explain, summarize, generate, provide) and contains 6.5K curated instances. In end-to-end evaluation with a fixed dense retriever, general-purpose Language Models (LMs) perform similarly, while a model trained on DoRA (DoRA SFT) yields large gains over the base model (Llama3.1-8B-Instruct): up to 26% improvement in QA task success, while reducing the hallucination rate by 47% in RAG faithfulness scores, supporting contamination-aware regression testing under domain shift.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DoRA, a synthetic benchmark of 6.5K intent-conditioned QA pairs derived from defense documents and paired with auditable evidence passages across five question types. It reports that general-purpose LMs perform similarly on this benchmark with a fixed dense retriever, while a model fine-tuned on DoRA (DoRA SFT) achieves up to 26% higher QA task success and 47% lower hallucination rates in RAG faithfulness scores compared to the Llama3.1-8B-Instruct base model.
Significance. If the synthetic data is shown to faithfully represent real defense query distributions without generation artifacts, DoRA could provide a useful contamination-aware benchmark for domain-specific RAG evaluation and fine-tuning, addressing limitations of public-corpus benchmarks.
major comments (2)
- [Abstract and Results] The headline DoRA SFT results (26% QA improvement, 47% hallucination reduction) are measured on the same 6.5K synthetic instances used for SFT. This does not demonstrate generalization to held-out queries or real user distributions and directly undermines the claim of improved RAG behavior under domain shift.
- [Benchmark Construction] No quantitative checks (e.g., KL divergence to real query logs, expert fidelity ratings, or artifact detection) are described for whether the intent-conditioned synthetic QA pairs and curated evidence passages match the statistical properties of actual defense document queries, including question-type distribution and attribution difficulty.
minor comments (2)
- [Abstract] The abstract states specific percentage improvements without supplying evaluation protocol details, baseline comparisons, statistical tests, or error analysis.
- [Evaluation] Clarify the exact definitions and computation of 'QA task success' and 'RAG faithfulness scores' and whether a train/test split was used for the SFT evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point-by-point below, agreeing with the concerns where valid and outlining specific revisions to strengthen the manuscript without overstating our claims.
read point-by-point responses
-
Referee: [Abstract and Results] The headline DoRA SFT results (26% QA improvement, 47% hallucination reduction) are measured on the same 6.5K synthetic instances used for SFT. This does not demonstrate generalization to held-out queries or real user distributions and directly undermines the claim of improved RAG behavior under domain shift.
Authors: We acknowledge that the reported DoRA SFT results were computed on the full set of 6.5K synthetic instances used for fine-tuning, which limits direct evidence of generalization to held-out queries. To address this, we will revise the manuscript to include an explicit train/test split (e.g., 80/20) of the DoRA benchmark, with all headline metrics recomputed on the unseen test portion. The abstract and results sections will be updated accordingly, and claims about 'domain shift' will be qualified to refer specifically to performance gains on this synthetic benchmark for contamination-aware evaluation rather than broad generalization to operational user distributions. We note that real defense query logs remain inaccessible due to classification constraints. revision: yes
-
Referee: [Benchmark Construction] No quantitative checks (e.g., KL divergence to real query logs, expert fidelity ratings, or artifact detection) are described for whether the intent-conditioned synthetic QA pairs and curated evidence passages match the statistical properties of actual defense document queries, including question-type distribution and attribution difficulty.
Authors: We agree that additional validation metrics would improve the benchmark description. Due to the sensitive and classified nature of the source defense documents, real query logs are unavailable, precluding KL divergence or direct statistical matching to operational distributions. We will expand the benchmark construction section with: (1) explicit reporting of question-type balance across the five categories, (2) details on evidence passage curation for attribution, (3) basic statistical summaries (lengths, vocabulary overlap) and post-generation filtering steps to address artifact detection, and (4) a limitations paragraph noting the absence of expert fidelity ratings. These additions will be quantitative where possible within the constraints of the data. revision: partial
- Quantitative comparison (e.g., KL divergence) to real defense query logs, as such logs are inaccessible due to classification and security restrictions.
- Expert fidelity ratings on the synthetic QA pairs, as this would require domain-expert access to classified materials not available during the original study.
Circularity Check
No circularity: empirical benchmark with direct model comparisons
full rationale
The paper constructs a synthetic benchmark (DoRA) from defense documents and reports empirical performance of models including a fine-tuned variant (DoRA SFT) versus the base Llama-3.1-8B-Instruct. No mathematical derivation chain, self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations exist. Central claims are direct end-to-end QA and faithfulness metrics on the constructed instances, without any reduction of results to inputs by construction. This is a standard empirical benchmark paper whose claims remain independent of the listed circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A fixed dense retriever is sufficient and representative for end-to-end RAG evaluation on defense documents
invented entities (1)
-
DoRA benchmark
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations , year =
RAGAs: Automated Evaluation of Retrieval Augmented Generation , author =. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations , year =
-
[2]
Wu, Junde and Zhu, Jiayuan and Qi, Yunli and Chen, Jingkun and Xu, Min and Menolascina, Filippo and Jin, Yueming and Grau, Vicente. Medical Graph RAG : Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. ...
-
[3]
Generating
Filice, Simone and Horowitz, Guy and Carmel, David and Karnin, Zohar and Lewin-Eytan, Liane and Maarek, Yoelle , booktitle =. Generating
-
[4]
arXiv preprint arXiv:2505.14212 , year=
Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks , author=. arXiv preprint arXiv:2505.14212 , year=
-
[5]
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering , author =. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[6]
Laradji , booktitle=
Amirhossein Abaskohi and Spandana Gella and Giuseppe Carenini and Issam H. Laradji , booktitle=. 2025 , url=
2025
-
[7]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =
Synthetic Multimodal Question Generation , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =. 2024 , address =
2024
-
[8]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[9]
Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track , year =
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation , author =. Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track , year =
2024
-
[10]
SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction , author=. arXiv preprint arXiv:2503.01478 , year=
-
[11]
FaithEval: Can Your Language Model Stay Faithful to Context, Even If “The Moon is Made of Marshmallows” , author=. arXiv preprint arXiv:2410.03727 , year=
-
[12]
arXiv preprint arXiv:2409.03759v1 , year=
VERA: Validation and Evaluation of Retrieval-Augmented systems , author=. arXiv preprint arXiv:2409.03759v1 , year=
-
[13]
Advances in Neural Information Processing Systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=
-
[14]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao and Yun Xiong and Xinyu Gao and Kangxiang Jia and Jinliu Pan and Yuxi Bi and Yi Dai and Jiawei Sun and Qianyu Guo and Meng Wang and Haofen Wang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.10997 , eprinttype =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2023
-
[15]
KILT : a benchmark for knowledge intensive language tasks
Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt\". Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua...
-
[16]
Transactions of the Association for Computational Linguistics , volume=
Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=
-
[17]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
SQuAD: 100,000+ questions for machine comprehension of text , author=. arXiv preprint arXiv:1606.05250 , year=
work page internal anchor Pith review arXiv
-
[18]
and Artzi, Yoav , journal=
Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , journal=. BERTScore: Evaluating Text Generation with
-
[19]
Sellam, Thibault and Das, Dipanjan and Parikh, Ankur , journal=
-
[20]
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
2018
-
[22]
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[23]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Benchmarking large language models in retrieval-augmented generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[24]
CRUD-RAG: A comprehensive chinese benchmark for retrieval-augmented generation of large language models , author=. arXiv preprint arXiv:2401.17043 , year=
-
[25]
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems , author =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =. 2024 , address =
2024
-
[26]
2024 , eprint=
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems , author=. 2024 , eprint=
2024
-
[27]
Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries , author=. arXiv preprint arXiv:2401.15391 , year=
-
[28]
2021 , eprint=
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. 2021 , eprint=
2021
-
[29]
2008 , publisher =
Introduction to Information Retrieval , author =. 2008 , publisher =
2008
-
[30]
and Zhang, Tianyi and Liang, Percy , title =
Liu, Nelson F. and Zhang, Tianyi and Liang, Percy , title =. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =. 2023 , doi =
2023
-
[31]
ZeroGen: Efficient Zero-shot Learning via Dataset Generation , booktitle =
ZeroGen: Efficient Zero-shot Learning via Dataset Generation , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , month = dec, year =. doi:10.18653/v1/2022.emnlp-main.801 , pages =
-
[32]
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , month = may, year =
Prompting-based Synthetic Data Generation for Few-Shot Question Answering , author =. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , month = may, year =
2024
-
[33]
Text summarization branches out , pages=
ROUGE: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
-
[34]
Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
BLEU: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
-
[35]
1997 , publisher =
Dan Gusfield , title =. 1997 , publisher =
1997
-
[36]
2024 , howpublished =
Introducing Llama 3.1: Open Foundation Models at Scale , author =. 2024 , howpublished =
2024
-
[37]
government for national security , author =
Meta offers Llama AI models to U.S. government for national security , author =. CIO , year =
-
[38]
Foundations and Trends in Information Retrieval , volume =
The Probabilistic Relevance Framework: BM25 and Beyond , author =. Foundations and Trends in Information Retrieval , volume =. 2009 , publisher =
2009
-
[39]
2025 , howpublished =
mGTE: Generalized Text Embedding Models for 75 Languages and 8k Context Length , author =. 2025 , howpublished =
2025
-
[40]
2025 , howpublished =
MiniCPM-Embedding: A Bilingual and Cross-Lingual Text Embedding Model , author =. 2025 , howpublished =
2025
-
[41]
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author =. arXiv preprint arXiv:2402.03216 , year =
work page internal anchor Pith review arXiv
-
[42]
arXiv preprint arXiv:2508.21085 , year =
Granite Embedding R2 Models , author =. arXiv preprint arXiv:2508.21085 , year =
-
[43]
CLEAN EVAL : Clean Evaluation on Contaminated Large Language Models
Zhu, Yifeng and Liu, Yiqi and Yang, Jiashuo and Jia, Mengzhou and Wang, Minjie and Li, Chao and Li, Jia and Wong, Kam-Fai and Liu, Zitao. CLEAN EVAL : Clean Evaluation on Contaminated Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024
2024
-
[44]
An Open-Source Data Contamination Report for Large Language Models
Li, Yixiao and van Leeuwen, Mathijs and Drevon, Gabriel and van der Wees, Marlies and Tang, Yixuan and Bulian, Jannis and Di Gangi, Mattia and Cahyawijaya, Samuel and Laskar, Md Tahmid Rahman and Pan, Jianguo and Zhang, Wenyi and Michel, Paul and Neubig, Graham and Weller, Orion. An Open-Source Data Contamination Report for Large Language Models. Findings...
2024
-
[45]
Extracting Training Data from Large Language Models , booktitle =
Carlini, Nicholas and Tramer, Florian and Wallace, Eric and Jagielski, Matthew and Herbert-Voss, Ariel and Lee, Katherine and Roberts, Adam and Brown, Tom and Song, Dawn and Erlingsson,. Extracting Training Data from Large Language Models , booktitle =. 2021 , address =
2021
-
[46]
and Marques, Jo \ a o DS and Gra c a, Miguel and Freire, Miguel and Li, Lei and Oliveira, Arlindo L
Duarte, Andr \'e V. and Marques, Jo \ a o DS and Gra c a, Miguel and Freire, Miguel and Li, Lei and Oliveira, Arlindo L. L umber C hunker: Long-Form Narrative Document Segmentation. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.377
-
[47]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.