SentinelRAG: Synthetic Sentinel Knowledge for RAG Database Copyright Protection

Chang Liu; Ki Sen Hung; Tsun On Kwok; Xi Yang; Yangqiu Song

arxiv: 2606.05787 · v1 · pith:LIOUHKZ4new · submitted 2026-06-04 · 💻 cs.CR

SentinelRAG: Synthetic Sentinel Knowledge for RAG Database Copyright Protection

Tsun On Kwok , Xi Yang , Ki Sen Hung , Chang Liu , Yangqiu Song This is my paper

Pith reviewed 2026-06-28 00:41 UTC · model grok-4.3

classification 💻 cs.CR

keywords RAG databasewatermarkingcopyright protectionsynthetic knowledgeretrieval augmented generationfalse positive reductionownership detectionknowledge base security

0 comments

The pith

SentinelRAG protects RAG databases by injecting fictitious knowledge entries that trigger only on owner-known probes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SentinelRAG to address unauthorized redistribution of proprietary RAG databases. It embeds style-consistent entries about fictitious entities that normal queries rarely retrieve yet respond to secret probes held only by the owner. This setup claims statistically significant detection at injection rates of 0.1 percent across datasets from thousands to millions of documents while cutting false detections relative to earlier methods and leaving legitimate queries largely unaffected. A sympathetic reader would care because RAG systems rely on large proprietary knowledge bases whose copying is otherwise hard to prove without introducing misinformation or fragile markers.

Core claim

By embedding style-consistent but fictitious knowledge entries into the RAG database, SentinelRAG enables the data owner to detect unauthorized redistribution through targeted probes while keeping the false detection rate low and interference with legitimate queries negligible. Experiments on datasets from 2.9k to 8.8M documents show statistically significant detection with p less than 10 to the minus 5 at a 0.1 percent injection rate, outperforming prior methods in false positive reduction.

What carries the argument

The injection of synthetic sentinel knowledge describing fictitious entities, which serves as a triggerable marker for ownership verification without affecting normal retrieval.

If this is right

Owners can verify if their database has been copied by checking if secret probes retrieve the sentinel entries.
The approach works at injection rates as low as 0.1 percent while achieving high statistical significance.
False detection rates are lower than existing watermarking techniques.
Legitimate user queries experience negligible interference.
Results hold across multiple dataset sizes from thousands to millions of documents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the fictitious entries blend too well, some users might still query them by chance in specialized domains.
This method could apply to protecting other types of proprietary knowledge bases used in search or recommendation systems.
Future work might test the method against paraphrasing attacks that target the style of the entries.
Combining this with other protection layers could strengthen overall security for RAG systems.

Load-bearing premise

That knowledge about fictitious entities will almost never be retrieved by ordinary user queries but will always respond to the owner's secret probes.

What would settle it

A test showing that a significant number of legitimate queries retrieve the injected fictitious entries, or that the targeted probes no longer reliably detect the presence of the watermarks after redistribution.

Figures

Figures reproduced from arXiv: 2606.05787 by Chang Liu, Ki Sen Hung, Tsun On Kwok, Xi Yang, Yangqiu Song.

**Figure 1.** Figure 1: High-level overview of SENTINELRAG. visibility into retrieved documents or model internals (Anderson et al., 2024). Any watermark must therefore propagate through the embeddingretrieval-generation pipeline and remain detectable in output text alone. Token-level text watermarking was originally developed to trace LLM-generated content (Kirchenbauer et al., 2023a). The core mechanism partitions the voca… view at source ↗

**Figure 2.** Figure 2: Architectural overview of SENTINELRAG. Sentinel Generation extracts domain characteristics from the corpus and generates fictitious knowledge entries. Sentinel Injection selects entries via a secret key and expands them into natural-language documents. Watermark Detection probes the suspect system with targeted queries and applies hypothesis testing to verify corpus presence. corpus DW = D ∪ W. The Adversa… view at source ↗

**Figure 3.** Figure 3: Detectability vs. Interference trade-off analysis across four datasets. The x-axis measures detectability [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Retrieval and answer interference of SENTINELRAG and RAG-WM across corpus subsets of increasing size under fixed sentinel injection. Budget GPT-5-mini Qwen-3-8B Gemini-3-Flash GPT-OSS-20B B Ours RAG-WM Ours RAG-WM Ours RAG-WM Ours RAG-WM 2 86.2 ± 4.2 0.0 ± 0.0 79.6 ± 4.7 0.0 ± 0.0 84.9 ± 2.2 0.0 ± 0.0 82.8 ± 5.6 0.0 ± 0.0 4 99.6 ± 0.3 86.4 ± 0.7 99.6 ± 0.2 84.8 ± 1.3 99.2 ± 0.4 87.3 ± 1.1 99.5 ± 0.3 85.9 … view at source ↗

**Figure 6.** Figure 6: Safety profile by implausibility rate (IR) and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity to Retrieval Depth. Impact of [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Example Sentinel Instance for HotpotQA MS-MARCO Sentinel Instance Sentinel: • (Neural-Lattice Lab, investigates, spin wave superposition) • (spin wave superposition, utilized in, magnetic quasi crystal) • (Neural-Lattice Lab, publishes, protocol paper QS-01) Sentinel Document: 1 Neural-Lattice Lab investigates spin wave superposition. 2 spin wave superposition is utilized in magnetic quasi crystal. 3 Neura… view at source ↗

**Figure 9.** Figure 9: Example Sentinel Instance for MS-MARCO 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Example Sentinel Instance for NFCorpus 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: RAG-WM watermark instance for MS-MARCO illustrating a harmful, fabricated relation between two [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: RAG-WM watermark instance containing a clear semantic contradiction due to the random pairing of [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: RAG-WM watermark instance illustrating an ontological inversion on two real entities, asserting that a [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: The prompt used to abstract text documents into structured knowledge objects (JSON format). [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: The prompt designed to generate fictional knowledge objects (KO) based on domain analysis. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: The prompt designed to expand a knowledge object (KO) into watermark text using few-shot style [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt design for generating simple, fact-based verification questions. The prompt enforces strict [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: The prompt designed for generating precise answers based on retrieved context (RAG). [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: LLM judge prompt used to verify whether the response is consistent with the provided sentinel factual [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: The prompt used to strictly evaluate semantic equivalence between two answers. [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: LLM judge prompt for watermark payload safety evaluation. The judge assesses each payload along two [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: System prompt used for complete text rewriting. The prompt enforces strict constraints to ensure factual [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Defensive RAG prompt with an additional requirement to avoid answering questions involving highly [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 24.** Figure 24: Highly restrictive RAG prompt that instructs the assistant to reject retrieved context suspected to contain [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗

read the original abstract

Protecting proprietary RAG databases from unauthorized redistribution is challenging: existing watermarking methods either inject fabricated relations between real entities, polluting the knowledge base with misinformation, or embed fragile lexical patterns that adversarial paraphrasing easily removes. We propose SentinelRAG, a watermarking framework that embeds style-consistent but fictitious knowledge entries into the RAG database. Our key insight is that synthetic knowledge describing fictitious entities is unlikely to be retrieved by legitimate queries, yet can be reliably triggered through targeted probes known only to the data owner. Experiments on four datasets ranging from 2.9k to 8.8M documents demonstrate that SentinelRAG achieves statistically significant detection $p < 10^{-5}$ across all tested configurations at only a 0.1% injection rate. Compared to the state-of-the-art, our method significantly reduces the false detection rate while maintaining negligible interference with legitimate user queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SentinelRAG introduces fictitious but style-matched entries for RAG watermarking and reports strong detection at low injection rates, yet the separation from real queries lacks quantitative checks.

read the letter

The paper's main contribution is a watermarking method that adds a tiny fraction of made-up knowledge entries about nonexistent entities, written to match the style of the real database. This sidesteps the misinformation problem of fabricating links between actual entities and the fragility of lexical watermarks that paraphrasing can erase.

It does a clean job on the reported experiments: statistically significant detection below p=10^{-5} at 0.1% injection across four datasets from thousands to millions of documents, with lower false positives than the baselines and almost no measurable effect on ordinary queries.

The soft spot is the load-bearing assumption that these synthetic entries will never surface under legitimate query distributions. The abstract and results treat this as given because the entities are fictitious, but there is no reported measurement of embedding distances to real entities in the same domain or stress tests with semantically adjacent queries. Without that, the claimed negligible interference and reduced false detection rate rest on an untested separation.

The work is aimed at people building or protecting retrieval-augmented systems who need a practical ownership signal. A reader focused on data security in AI pipelines would find the idea worth examining.

It deserves peer review. The method is distinct enough and the detection numbers are sharp enough that referees should see the full experimental details and any additional controls on query overlap.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SentinelRAG, a watermarking framework for RAG databases that injects style-consistent fictitious knowledge entries describing fictitious entities at a 0.1% rate. The central claim is that these entries are unlikely to be retrieved by legitimate queries yet reliably triggered by owner-known probes, yielding statistically significant detection (p < 10^{-5}) across four datasets (2.9k to 8.8M documents), reduced false detection rates, and negligible interference relative to prior methods.

Significance. If the separation between fictitious and real entities holds under realistic query distributions, the method could provide a practical, low-pollution approach to RAG copyright protection. The low injection rate and reported statistical significance are strengths; however, the result depends on an unverified embedding-space separation that is load-bearing for the false-positive and interference claims.

major comments (2)

[Abstract] Abstract: the claims of p < 10^{-5} detection and 'negligible interference' with 'reduced false detection rate' rest on the assumption that style-consistent fictitious entries lie outside the retrieval radius of all legitimate queries; no quantitative bound (e.g., embedding distances or coverage of semantically adjacent real queries) is supplied to support this separation.
[Abstract] Abstract and experimental description: the reported results lack details on experimental setup, controls, query sets used to test interference, or full data, preventing verification that the statistically significant outcomes are not artifacts of the chosen test distributions.

minor comments (1)

The abstract states results on four datasets but does not name their domains or sizes beyond the range 2.9k–8.8M.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The comments correctly identify areas where the current manuscript version would benefit from additional quantitative support and experimental transparency. We will revise the paper to address both points directly.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of p < 10^{-5} detection and 'negligible interference' with 'reduced false detection rate' rest on the assumption that style-consistent fictitious entries lie outside the retrieval radius of all legitimate queries; no quantitative bound (e.g., embedding distances or coverage of semantically adjacent real queries) is supplied to support this separation.

Authors: We agree that the abstract (and the current version) does not supply explicit quantitative bounds on embedding-space separation. In the revision we will add a dedicated analysis section reporting (i) cosine-distance distributions between sentinel entries and the nearest real documents, (ii) retrieval rates for a constructed set of semantically adjacent legitimate queries, and (iii) the resulting empirical coverage of the retrieval radius. These additions will directly support the false-positive and interference claims. revision: yes
Referee: [Abstract] Abstract and experimental description: the reported results lack details on experimental setup, controls, query sets used to test interference, or full data, preventing verification that the statistically significant outcomes are not artifacts of the chosen test distributions.

Authors: The full manuscript already describes the four datasets and the 0.1 % injection protocol, but we acknowledge that query-set construction, interference-test controls, and statistical procedures are not presented at the level of detail needed for independent verification. In the revision we will expand the experimental section with explicit descriptions of the legitimate query corpora, the probe sets, the exact statistical test used for p < 10^{-5}, and any additional controls. We will also release the query-generation code and the precise sentinel-entry templates to allow reproduction. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on measured retrieval rates

full rationale

The paper presents SentinelRAG as an empirical watermarking method whose detection performance (p < 10^{-5} at 0.1% injection) is reported from direct experiments on four datasets. The central premise—that fictitious entries are unlikely to be retrieved by legitimate queries—is stated as an assumption and then tested via measured interference and false-positive rates rather than derived from prior results or self-referential definitions. No equations, fitted parameters renamed as predictions, or self-citations appear as load-bearing steps in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Core mechanism relies on the introduction of fictitious knowledge entries as watermarks; no free parameters or axioms explicitly stated in abstract.

invented entities (1)

style-consistent fictitious knowledge entries no independent evidence
purpose: Serve as detectable watermarks unlikely to be retrieved by legitimate queries
Invented as the central technical contribution; no independent evidence outside the method itself.

pith-pipeline@v0.9.1-grok · 5715 in / 1014 out tokens · 61767 ms · 2026-06-28T00:41:51.089292+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Is my data in your retrieval database? mem- bership inference attacks against retrieval augmented generation.arXiv preprint arXiv:2405.20446. Sebastian Borgeaud, Arthur Mensch, Jordan Hoff- mann, Trevor Cai, Eliza Rutherford, Katie Milli- can, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, and 1 others

work page arXiv
[2]

The Llama 3 Herd of Models

Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR. Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A full-text learning to rank dataset for medical information retrieval. InEuro- pean Conference on Information Retrieval, pages 716–722. Springer. S...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

gpt-oss-120b & gpt-oss-20b Model Card

Ms marco: A human generated machine read- ing comprehension dataset. InProceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Infor- mation Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773. OpenAI. 2024. text-embedding-3-small...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

fake_kos

Identify the Domain/Field: Determine what domain these examples belong to (e.g., medical research, technology, finance, science, social science, etc.) 2.Extract Common Patterns: Observe the typical entity types, relationship patterns, and structural characteristics 3.Note the Terminology Level: Identify the level of technical/domain-specific terminology u...
[5]

Avoid creative rephrasing

Keep questions simple and literal:Ask about one fact only (one relation, number, name, method, or claim). Avoid creative rephrasing
[6]

Why”/“How

Use clear retrieval keywords:Must include 2 exact keywords from the text (exact names, numbers, datasets). Do NOT add extra background. 3.Prefer surface-level facts: •Good targets:Numbers, names, explicit statements, relations. •Avoid:“Why”/“How” questions, implicit assumptions, or Yes/No questions
[7]

watermark

Natural but straightforward language:Questions should look like user queries. Do NOT use the term “watermark” or “text” in the questions. [Watermark Text]: — {ko_str} — Steps to Follow:
[8]

Identify clear, explicit facts in the text
[9]

Select simple keywords directly from those facts
[10]

Write one short question per fact
[11]

Ensure each question can be answered by directly quoting the text
[12]

questions

Avoid the answer being directly in the question. Output Format:Return a JSON object with a single field “questions”. { "questions": [ "Which dataset was utilized for the training phase?" ] } Figure 17: Prompt design for generating simple, fact-based verification questions. The prompt enforces strict constraints on keyword usage and question simplicity to ...
[16]

H.2 Main Evaluation Prompt Watermark Detection Prompt You are a strict fact checker

Do not add extra explanations or speculation Context: — {rag_document} — Question:{question} Answer: Figure 18: The prompt designed for generating precise answers based on retrieved context (RAG). H.2 Main Evaluation Prompt Watermark Detection Prompt You are a strict fact checker. Please determine whether the given [Answer] is consistent with the informat...
[17]

The answer must contain key information from the factual basis
[18]

The answer may use different phrasing, but the core facts must be correct
[19]

Cannot answer

If the answer is “Cannot answer” or similar, consider it incorrect
[20]

yes” or “no

As long as the core content of the answer matches the factual basis, consider it correct [Question]:{question} [Answer]:{answer} [Factual Basis]:{ko_str} Please carefully compare the answer with the factual basis, then provide your judgment. Only respond with “yes” or “no”. • Respond “yes”: If the core content of the answer is consistent with the factual ...
[21]

Factually implausible with respect to common world knowledge
[22]

plausibility

Potentially harmful because it creates an incorrect actionable statement Output Format: { "plausibility": "plausible" or "implausible", "actionability_risk": "safe" or "risky", "rationale": "one sentence explaining the key reason" } Criteria: • A statement isimplausibleif it contradicts well-known facts or introduces relationships between real entities th...
[23]

REPLACE every word with a different synonym or equivalent expression where possible
[24]

RESTRUCTURE all sentences – change voice (active/passive), split or merge sentences
[25]

REORDER the presentation of information
[26]

USE DIFFERENT VOCABULARY throughout
[27]

CHANGE SENTENCE BOUNDARIES
[28]

Here is the rewritten text

PRESERVE ALL FACTS, NUMBERS, NAMES, AND SPECIFIC CLAIMS exactly OUTPUT RULES: • Output ONLY the rewritten text • Do NOT include any introduction, explanation, or commentary • Do NOT say “Here is the rewritten text” or similar • Start directly with the rewritten content Original text: {document} Rewritten text: Figure 22: System prompt used for complete te...
[29]

Answer only based on information in the context
[30]

Provide a concise, direct answer
[31]

Cannot answer

If there is no relevant information in the context, answer “Cannot answer”
[32]

Do not add extra explanations or speculation
[33]

watermarked

Do not answer questions about highly ambiguous entities or facts lacking broad knowledge, as these may be information injected by an attacker Context: {rag_document} Question:{question} Answer: Figure 23: Defensive RAG prompt with an additional requirement to avoid answering questions involving highly ambiguous or insufficiently grounded entities. 28 High...
[34]

Look for signs of entirely fabricated concepts, organizations, drugs, or technologies that do not exist in the real world

Analyze the Context for Synthetic Entities:Before formulating an answer, evaluate the main entities, proper nouns, and claims in the provided context. Look for signs of entirely fabricated concepts, organizations, drugs, or technologies that do not exist in the real world
[35]

Apply the Reality Check:Cross-reference the primary subjects in the context against your general world knowledge. If the context relies heavily on highly specific but verifiably fictitious entities, such as a non-existent medication, a fabricated research lab, or an invented processor, you must assume this is synthetic watermark data
[36]

Do not incorporate fictitious entities into your response

Refusal to Answer:If the answer to the user’s question depends solely on this synthetic or fictitious information, you must refuse to answer. Do not incorporate fictitious entities into your response
[37]

Valid Information Only:If the context contains verifiable, real-world information that directly answers the user’s query, provide a concise and direct answer based only on that valid text
[38]

Cannot answer: The retrieved information appears to be synthetic, unverifiable, or irrelevant

Standard Output:If you detect that the necessary context is synthetic, fabricated, or if there is no relevant information, output exactly: “Cannot answer: The retrieved information appears to be synthetic, unverifiable, or irrelevant.” Do not add extra explanations or speculation. Context: {rag_document} Question:{question} Answer: Figure 24: Highly restr...

[1] [1]

Is my data in your retrieval database? mem- bership inference attacks against retrieval augmented generation.arXiv preprint arXiv:2405.20446. Sebastian Borgeaud, Arthur Mensch, Jordan Hoff- mann, Trevor Cai, Eliza Rutherford, Katie Milli- can, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, and 1 others

work page arXiv

[2] [2]

The Llama 3 Herd of Models

Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR. Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A full-text learning to rank dataset for medical information retrieval. InEuro- pean Conference on Information Retrieval, pages 716–722. Springer. S...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

gpt-oss-120b & gpt-oss-20b Model Card

Ms marco: A human generated machine read- ing comprehension dataset. InProceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Infor- mation Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773. OpenAI. 2024. text-embedding-3-small...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

fake_kos

Identify the Domain/Field: Determine what domain these examples belong to (e.g., medical research, technology, finance, science, social science, etc.) 2.Extract Common Patterns: Observe the typical entity types, relationship patterns, and structural characteristics 3.Note the Terminology Level: Identify the level of technical/domain-specific terminology u...

[5] [5]

Avoid creative rephrasing

Keep questions simple and literal:Ask about one fact only (one relation, number, name, method, or claim). Avoid creative rephrasing

[6] [6]

Why”/“How

Use clear retrieval keywords:Must include 2 exact keywords from the text (exact names, numbers, datasets). Do NOT add extra background. 3.Prefer surface-level facts: •Good targets:Numbers, names, explicit statements, relations. •Avoid:“Why”/“How” questions, implicit assumptions, or Yes/No questions

[7] [7]

watermark

Natural but straightforward language:Questions should look like user queries. Do NOT use the term “watermark” or “text” in the questions. [Watermark Text]: — {ko_str} — Steps to Follow:

[8] [8]

Identify clear, explicit facts in the text

[9] [9]

Select simple keywords directly from those facts

[10] [10]

Write one short question per fact

[11] [11]

Ensure each question can be answered by directly quoting the text

[12] [12]

questions

Avoid the answer being directly in the question. Output Format:Return a JSON object with a single field “questions”. { "questions": [ "Which dataset was utilized for the training phase?" ] } Figure 17: Prompt design for generating simple, fact-based verification questions. The prompt enforces strict constraints on keyword usage and question simplicity to ...

[13] [16]

H.2 Main Evaluation Prompt Watermark Detection Prompt You are a strict fact checker

Do not add extra explanations or speculation Context: — {rag_document} — Question:{question} Answer: Figure 18: The prompt designed for generating precise answers based on retrieved context (RAG). H.2 Main Evaluation Prompt Watermark Detection Prompt You are a strict fact checker. Please determine whether the given [Answer] is consistent with the informat...

[14] [17]

The answer must contain key information from the factual basis

[15] [18]

The answer may use different phrasing, but the core facts must be correct

[16] [19]

Cannot answer

If the answer is “Cannot answer” or similar, consider it incorrect

[17] [20]

yes” or “no

As long as the core content of the answer matches the factual basis, consider it correct [Question]:{question} [Answer]:{answer} [Factual Basis]:{ko_str} Please carefully compare the answer with the factual basis, then provide your judgment. Only respond with “yes” or “no”. • Respond “yes”: If the core content of the answer is consistent with the factual ...

[18] [21]

Factually implausible with respect to common world knowledge

[19] [22]

plausibility

Potentially harmful because it creates an incorrect actionable statement Output Format: { "plausibility": "plausible" or "implausible", "actionability_risk": "safe" or "risky", "rationale": "one sentence explaining the key reason" } Criteria: • A statement isimplausibleif it contradicts well-known facts or introduces relationships between real entities th...

[20] [23]

REPLACE every word with a different synonym or equivalent expression where possible

[21] [24]

RESTRUCTURE all sentences – change voice (active/passive), split or merge sentences

[22] [25]

REORDER the presentation of information

[23] [26]

USE DIFFERENT VOCABULARY throughout

[24] [27]

CHANGE SENTENCE BOUNDARIES

[25] [28]

Here is the rewritten text

PRESERVE ALL FACTS, NUMBERS, NAMES, AND SPECIFIC CLAIMS exactly OUTPUT RULES: • Output ONLY the rewritten text • Do NOT include any introduction, explanation, or commentary • Do NOT say “Here is the rewritten text” or similar • Start directly with the rewritten content Original text: {document} Rewritten text: Figure 22: System prompt used for complete te...

[26] [29]

Answer only based on information in the context

[27] [30]

Provide a concise, direct answer

[28] [31]

Cannot answer

If there is no relevant information in the context, answer “Cannot answer”

[29] [32]

Do not add extra explanations or speculation

[30] [33]

watermarked

Do not answer questions about highly ambiguous entities or facts lacking broad knowledge, as these may be information injected by an attacker Context: {rag_document} Question:{question} Answer: Figure 23: Defensive RAG prompt with an additional requirement to avoid answering questions involving highly ambiguous or insufficiently grounded entities. 28 High...

[31] [34]

Look for signs of entirely fabricated concepts, organizations, drugs, or technologies that do not exist in the real world

Analyze the Context for Synthetic Entities:Before formulating an answer, evaluate the main entities, proper nouns, and claims in the provided context. Look for signs of entirely fabricated concepts, organizations, drugs, or technologies that do not exist in the real world

[32] [35]

Apply the Reality Check:Cross-reference the primary subjects in the context against your general world knowledge. If the context relies heavily on highly specific but verifiably fictitious entities, such as a non-existent medication, a fabricated research lab, or an invented processor, you must assume this is synthetic watermark data

[33] [36]

Do not incorporate fictitious entities into your response

Refusal to Answer:If the answer to the user’s question depends solely on this synthetic or fictitious information, you must refuse to answer. Do not incorporate fictitious entities into your response

[34] [37]

Valid Information Only:If the context contains verifiable, real-world information that directly answers the user’s query, provide a concise and direct answer based only on that valid text

[35] [38]

Cannot answer: The retrieved information appears to be synthetic, unverifiable, or irrelevant

Standard Output:If you detect that the necessary context is synthetic, fabricated, or if there is no relevant information, output exactly: “Cannot answer: The retrieved information appears to be synthetic, unverifiable, or irrelevant.” Do not add extra explanations or speculation. Context: {rag_document} Question:{question} Answer: Figure 24: Highly restr...