A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

Aarya Bodhankar; Aditya Joshi; Bao Gia Doan; Flora Salim; Oscar Leslie; Pantelis Elinas; Tom Marchant

arxiv: 2604.17943 · v2 · pith:O7ERIZCRnew · submitted 2026-04-20 · 💻 cs.CL

A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

Bao Gia Doan , Aditya Joshi , Pantelis Elinas , Aarya Bodhankar , Oscar Leslie , Tom Marchant , Flora Salim This is my paper

Pith reviewed 2026-05-10 04:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords DoRARAG benchmarkingdefense documentssynthetic QAhallucination reductiondomain shiftretrieval-augmented generationquestion answering

0 comments

The pith

A model fine-tuned on the DoRA benchmark achieves up to 26% higher QA success and 47% lower hallucination rates on defense documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DoRA, a benchmark built from defense documents that generates synthetic but intent-conditioned questions paired with traceable evidence passages. It tests retrieval-augmented generation across five question types and 6.5K instances to check both answer quality and source attribution. General language models perform similarly on this data, yet fine-tuning one on the DoRA examples produces clear gains in task success and fewer fabricated responses. This matters because open-domain benchmarks often inflate scores due to pretraining overlap, so domain-specific tests are needed to catch failures when models face unfamiliar content. The setup allows regression testing that accounts for contamination when models shift to new domains.

Core claim

DoRA is a domain-grounded benchmark with 6.5K synthetic instances that pairs intent-conditioned QA with auditable evidence passages. In end-to-end evaluation with a fixed dense retriever, general-purpose language models perform similarly to each other. A model trained on DoRA data, however, yields up to 26% improvement in QA task success over the base Llama3.1-8B-Instruct while reducing the hallucination rate by 47% in RAG faithfulness scores, supporting contamination-aware regression testing under domain shift.

What carries the argument

The DoRA benchmark of synthetic intent-conditioned QA pairs paired with curated evidence passages for attribution verification, covering five question types: find, explain, summarize, generate, provide.

If this is right

General-purpose language models show comparable performance when evaluated end-to-end on DoRA with a fixed retriever.
Fine-tuning on DoRA data produces up to 26% gains in QA task success.
RAG faithfulness scores improve with a 47% drop in hallucination rate after DoRA training.
The benchmark enables contamination-aware regression testing when models encounter domain shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Domain-specific synthetic benchmarks could be extended to other restricted fields such as legal or medical documents to test RAG reliability without large real-query collections.
The hallucination reduction indicates that training on traceable attribution examples may strengthen evidence adherence more broadly.
If the five question types cover most real defense inquiries, similar synthetic construction could lower the cost of building reliable domain tests.
Public benchmarks that ignore domain shift may systematically overestimate deployment readiness for specialized content.

Load-bearing premise

The synthetic intent-conditioned QA pairs and curated evidence passages faithfully represent the distribution and attribution challenges of real user queries on defense documents without introducing generation artifacts or selection bias.

What would settle it

Evaluating the DoRA-trained model on a held-out set of actual human-generated questions from defense document users and finding no improvement in success rate or no reduction in hallucination would show that the synthetic data fails to capture real performance.

Figures

Figures reproduced from arXiv: 2604.17943 by Aarya Bodhankar, Aditya Joshi, Bao Gia Doan, Flora Salim, Oscar Leslie, Pantelis Elinas, Tom Marchant.

**Figure 1.** Figure 1: DoRA pipeline, from data preparation to grounded-styled QA generation, and downstream domain evaluation and adaptation. (DoRA SFT). Beyond evaluation, we show how DoRA can serve as SFT-ready supervision: a LoRA-adapted open model trained on DoRA improves both task success and faithfulness diagnostics over strong general-purpose baselines under the same retriever setting, supporting an industry workflow o… view at source ↗

**Figure 2.** Figure 2: Retriever performance across top-k on DoRA. GTE retriever achieved overall better performance. instances as the benchmark evaluation set; summary statistics are shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Our DoRA SFT model vs ICL baselines. footprint and prior adoption in defense and national security AI contexts (Meta AI, 2024; Kapko, 2024). 5.4 Skyline Comparison with Manually Curated Dataset Finally, as a skyline where high-precision expert supervision is available, we curate an expertannotated set from the same seed documents. A domain expert authors 25 seed Q&A pairs (5 per question type), which we … view at source ↗

**Figure 4.** Figure 4: Prompt template used for generating QA pairs with In-Context Learning. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt template used for judging the quality of generated question and answers conditioned on the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and evaluation framework using only a small set of specialist domain documents. DoRA systematically generates synthetic QA training and evaluation datasets with auditable evidence across five domain-specific intents. To mitigate same-pipeline circularity, DoRA's training and test splits use different LLM families (Claude Sonnet for training; GPT-4o for test) drawn from disjoint seed-document corpora. Instantiated on 40 defense-related documents (written in English), DoRA yields ~6.6K curated instances. Compared against 8 LLM baselines over a benchmark of 1,259 samples, a LoRA-adapted Llama3.1-8B trained on the synthetic training set consistently improves performance over 6 coverage and faithfulness metrics, especially reducing hallucination by more than half under the default GTE retrieval setting, with gains persisting across alternative retrievers and prompting-based baselines. Defense-domain expertise is incorporated in three stages of our evaluation: (a) determining the quality of the synthetic QA generated by DoRA, (b) ascertaining the reliability of LLM-as-judge scores, and (c) evaluating the generalization of the QA pipeline on completely human-written QA examples. We position DoRA as a practical framework for specialist-domain RAG under domain shift, with defense as a high-stakes case study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DoRA builds a synthetic benchmark for RAG on defense documents and reports solid gains from fine-tuning on it, but those gains sit on generated data with no clear check against real query distributions.

read the letter

The paper introduces DoRA, a 6.5K-instance synthetic benchmark built from defense documents. It generates intent-conditioned questions across five types (find, explain, summarize, generate, provide) and pairs them with curated evidence passages for attribution testing. A model fine-tuned on DoRA shows up to 26% better QA success and 47% lower hallucination rates than the base Llama 3.1-8B-Instruct when both are run with the same retriever on the benchmark itself.

Referee Report

2 major / 2 minor

Summary. The paper introduces DoRA, a synthetic benchmark of 6.5K intent-conditioned QA pairs derived from defense documents and paired with auditable evidence passages across five question types. It reports that general-purpose LMs perform similarly on this benchmark with a fixed dense retriever, while a model fine-tuned on DoRA (DoRA SFT) achieves up to 26% higher QA task success and 47% lower hallucination rates in RAG faithfulness scores compared to the Llama3.1-8B-Instruct base model.

Significance. If the synthetic data is shown to faithfully represent real defense query distributions without generation artifacts, DoRA could provide a useful contamination-aware benchmark for domain-specific RAG evaluation and fine-tuning, addressing limitations of public-corpus benchmarks.

major comments (2)

[Abstract and Results] The headline DoRA SFT results (26% QA improvement, 47% hallucination reduction) are measured on the same 6.5K synthetic instances used for SFT. This does not demonstrate generalization to held-out queries or real user distributions and directly undermines the claim of improved RAG behavior under domain shift.
[Benchmark Construction] No quantitative checks (e.g., KL divergence to real query logs, expert fidelity ratings, or artifact detection) are described for whether the intent-conditioned synthetic QA pairs and curated evidence passages match the statistical properties of actual defense document queries, including question-type distribution and attribution difficulty.

minor comments (2)

[Abstract] The abstract states specific percentage improvements without supplying evaluation protocol details, baseline comparisons, statistical tests, or error analysis.
[Evaluation] Clarify the exact definitions and computation of 'QA task success' and 'RAG faithfulness scores' and whether a train/test split was used for the SFT evaluation.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point-by-point below, agreeing with the concerns where valid and outlining specific revisions to strengthen the manuscript without overstating our claims.

read point-by-point responses

Referee: [Abstract and Results] The headline DoRA SFT results (26% QA improvement, 47% hallucination reduction) are measured on the same 6.5K synthetic instances used for SFT. This does not demonstrate generalization to held-out queries or real user distributions and directly undermines the claim of improved RAG behavior under domain shift.

Authors: We acknowledge that the reported DoRA SFT results were computed on the full set of 6.5K synthetic instances used for fine-tuning, which limits direct evidence of generalization to held-out queries. To address this, we will revise the manuscript to include an explicit train/test split (e.g., 80/20) of the DoRA benchmark, with all headline metrics recomputed on the unseen test portion. The abstract and results sections will be updated accordingly, and claims about 'domain shift' will be qualified to refer specifically to performance gains on this synthetic benchmark for contamination-aware evaluation rather than broad generalization to operational user distributions. We note that real defense query logs remain inaccessible due to classification constraints. revision: yes
Referee: [Benchmark Construction] No quantitative checks (e.g., KL divergence to real query logs, expert fidelity ratings, or artifact detection) are described for whether the intent-conditioned synthetic QA pairs and curated evidence passages match the statistical properties of actual defense document queries, including question-type distribution and attribution difficulty.

Authors: We agree that additional validation metrics would improve the benchmark description. Due to the sensitive and classified nature of the source defense documents, real query logs are unavailable, precluding KL divergence or direct statistical matching to operational distributions. We will expand the benchmark construction section with: (1) explicit reporting of question-type balance across the five categories, (2) details on evidence passage curation for attribution, (3) basic statistical summaries (lengths, vocabulary overlap) and post-generation filtering steps to address artifact detection, and (4) a limitations paragraph noting the absence of expert fidelity ratings. These additions will be quantitative where possible within the constraints of the data. revision: partial

standing simulated objections not resolved

Quantitative comparison (e.g., KL divergence) to real defense query logs, as such logs are inaccessible due to classification and security restrictions.
Expert fidelity ratings on the synthetic QA pairs, as this would require domain-expert access to classified materials not available during the original study.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct model comparisons

full rationale

The paper constructs a synthetic benchmark (DoRA) from defense documents and reports empirical performance of models including a fine-tuned variant (DoRA SFT) versus the base Llama-3.1-8B-Instruct. No mathematical derivation chain, self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations exist. Central claims are direct end-to-end QA and faithfulness metrics on the constructed instances, without any reduction of results to inputs by construction. This is a standard empirical benchmark paper whose claims remain independent of the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; full paper would likely add more assumptions about synthetic data quality and retriever choice.

axioms (1)

domain assumption A fixed dense retriever is sufficient and representative for end-to-end RAG evaluation on defense documents
Stated in the evaluation description.

invented entities (1)

DoRA benchmark no independent evidence
purpose: Domain-specific synthetic QA test set with attribution
Newly constructed collection of 6.5K instances

pith-pipeline@v0.9.0 · 5478 in / 1315 out tokens · 63866 ms · 2026-05-10T04:47:26.021886+00:00 · methodology

A Benchmark Construction and Evaluation Framework for Specialist Domains: Case Study on Defense-related Documents

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)