pith. sign in

arxiv: 2606.02643 · v1 · pith:VRCEDPHLnew · submitted 2026-05-31 · 💻 cs.CR · cs.AI· cs.DB

Inference Cost Attacks for Retrieval-Augmented Large Language Models

Pith reviewed 2026-06-28 16:53 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.DB
keywords retrieval-augmented generationinference cost attacksknowledge base poisoningLLM securitytoken consumptionadversarial documents
0
0 comments X

The pith

Poisoning external knowledge bases forces retrieval-augmented LLMs to consume up to 13 times more tokens per query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that retrieval-augmented generation systems face a practical threat from attacks that poison their external knowledge sources instead of directly editing prompts. Attackers can insert crafted documents that the retriever will select, causing the downstream language model to generate responses that require far more tokens. The authors implement this through an automated framework that uses LLM agents to produce documents both relevant enough to be retrieved and expensive enough to inflate cost. Experiments across three datasets show token consumption rising by a factor of 13.12 with success rates above 90 percent while answer quality remains unchanged. This matters because many deployed RAG systems draw from open web sources that an adversary could realistically contaminate.

Core claim

The Retrieval-Augmented Inference Cost Attack succeeds by injecting malicious documents into external knowledge corpora; these documents are retrieved at inference time and trigger abnormally high token counts during generation, reaching a maximum increase of 13.12 times with success rates exceeding 90 percent and without any loss in answer integrity. The attack is realized through the CREEP framework, which deploys LLM agents fine-tuned by Memory-Augmented Group Relative Policy Optimization to generate documents that are semantically aligned for retrieval yet computationally burdensome.

What carries the argument

CREEP (Computational Resource Exhaustion via External Poisoning), a framework that uses LLM agents and memory-augmented reinforcement learning to automatically generate malicious documents that remain retrievable while forcing higher token consumption.

If this is right

  • RAG systems that rely on untrusted external data become exposed to cost-based resource exhaustion.
  • Attack effectiveness holds across multiple real-world datasets while preserving output correctness.
  • Existing RAG pipelines require additional checks on retrieved content beyond semantic relevance.
  • The same poisoning approach could be adapted to target other retrieval-dependent generation pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption of this attack would raise the operating cost of public RAG services that ingest open web content.
  • Document filtering or provenance verification mechanisms would directly test the practical reach of the attack.
  • Similar cost-inflation tactics might apply to retrieval components in non-LLM systems such as search engines or recommendation engines.

Load-bearing premise

Malicious documents can be inserted into external knowledge bases and will be retrieved by the RAG pipeline without detection or filtering.

What would settle it

Run identical queries against a RAG system before and after the knowledge base is seeded with the generated malicious documents and measure whether token consumption increases by the reported factor.

Figures

Figures reproduced from arXiv: 2606.02643 by Chengliang Liu, Liangbo Ning, Wenqi Fan, Yujuan Ding.

Figure 1
Figure 1. Figure 1: Comparison of (a) existing LLM inference cost at [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our proposed Computational Resource Exhaustion via External Poisoning (CREEP) framework. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity analysis of the hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG)-enhanced LLM systems, while powerful, introduce substantial inference costs due to the inclusion of an extra multi-stage pipeline that dynamically retrieves and synthesizes information from external knowledge sources. This high operational cost exposes a critical vulnerability to Inference Cost Attacks (ICAs). However, existing ICAs often rely on the impractical assumption of direct prompt manipulation. We argue that a more feasible and potent threat to RAG-enhanced LLM systems arises from poisoning external knowledge bases (e.g., web knowledge from the Internet). In this work, we introduce the Retrieval-Augmented Inference Cost Attack (RA-ICA), a novel attacking paradigm that targets the computational cost of RAG-enhanced LLM systems by injecting malicious documents into external knowledge corpus. To operationalize this attack, we propose Computational Resource Exhaustion via External Poisoning (CREEP), a novel framework that leverages LLM agents to automatically craft malicious documents that are both semantically relevant for retrieval and potent for inducing an abnormal increase in token consumption during the inference phase. To enhance the attack's effectiveness, we introduce Memory-Augmented Group Relative Policy Optimization (MA-GRPO), a novel reinforcement learning algorithm that fine-tunes the agents by learning from a dynamic memory of historical best adversarial documents. Extensive experiments across three real-world datasets demonstrate that RA-ICA increases token consumption by up to 13.12 times with an over 90% success rate, without degrading the integrity of the generated answer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that Retrieval-Augmented Inference Cost Attacks (RA-ICA) can be mounted against RAG-enhanced LLMs by injecting malicious documents into external knowledge bases. These documents are automatically generated by the CREEP framework, which uses LLM agents fine-tuned via the proposed Memory-Augmented Group Relative Policy Optimization (MA-GRPO) algorithm. Experiments on three real-world datasets reportedly show up to 13.12× increase in token consumption with >90% success rate while preserving answer integrity.

Significance. If the empirical results hold under realistic conditions, the work would be significant for highlighting a cost-based attack vector on RAG systems that does not require direct prompt access. The introduction of MA-GRPO as a memory-augmented RL method for crafting retrieval-potent adversarial documents is a technical contribution worth noting. The quantitative claims across multiple datasets provide concrete numbers that could motivate defenses, though the practical impact depends on the untested injection assumption.

major comments (2)
  1. [Threat Model and Experiments] The central claim of practical viability (13.12× token increase and >90% success) rests on the assumption that CREEP-generated documents can be injected into external/open-web KBs and retrieved without detection. The experiments use simulated/controlled datasets; no evaluation is provided of evasion against content filters, edit rate limits, or trusted-source prioritization. This is load-bearing for the threat model and should be addressed with either additional experiments or explicit scope limitations.
  2. [Abstract and § Experiments] Abstract and results reporting: quantitative claims (13.12× tokens, >90% success) are stated without accompanying details on baselines, statistical tests, variance across runs, or ablation of MA-GRPO components. The full experimental section must include these to allow assessment of whether the gains are attributable to the proposed method.
minor comments (1)
  1. [Introduction] Acronyms RA-ICA, CREEP, and MA-GRPO should be expanded on first use in the main text for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Threat Model and Experiments] The central claim of practical viability (13.12× token increase and >90% success) rests on the assumption that CREEP-generated documents can be injected into external/open-web KBs and retrieved without detection. The experiments use simulated/controlled datasets; no evaluation is provided of evasion against content filters, edit rate limits, or trusted-source prioritization. This is load-bearing for the threat model and should be addressed with either additional experiments or explicit scope limitations.

    Authors: We agree that the practical threat model depends on successful injection and retrieval. Our experiments deliberately focus on the core attack effectiveness (token inflation and success rate) under the assumption that the malicious documents are retrieved, which is consistent with the stated threat model of poisoning external KBs. We do not claim robustness against all possible defenses. In the revision we will add an explicit Limitations subsection that states the attack is conditional on retrieval success and does not evaluate evasion against content filters or trusted-source mechanisms. This directly implements the suggested scope limitation without requiring new experiments outside the paper's current scope. revision: yes

  2. Referee: [Abstract and § Experiments] Abstract and results reporting: quantitative claims (13.12× tokens, >90% success) are stated without accompanying details on baselines, statistical tests, variance across runs, or ablation of MA-GRPO components. The full experimental section must include these to allow assessment of whether the gains are attributable to the proposed method.

    Authors: We acknowledge that the current experimental reporting is insufficient for full reproducibility and attribution. The revised manuscript will expand the Experiments section to include: (1) explicit baseline comparisons (random document injection, non-RL poisoning, and simpler heuristic methods), (2) statistical significance testing (paired t-tests or Wilcoxon signed-rank tests with p-values), (3) variance reported as mean ± standard deviation across at least five independent runs, and (4) ablation studies isolating the contribution of the memory-augmented component and the group-relative optimization in MA-GRPO. These additions will be placed in the main experimental results and an appendix for completeness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack evaluation with no derivations or self-referential predictions.

full rationale

The paper introduces RA-ICA, CREEP, and MA-GRPO as new constructs and evaluates them via experiments on three datasets, reporting measured token increases and success rates. No equations, fitted parameters, or predictions are defined in terms of themselves; results are direct experimental observations rather than reductions of inputs. No load-bearing self-citations or uniqueness theorems appear in the provided text. The work is self-contained as an empirical demonstration.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 3 invented entities

Paper introduces three new named constructs (RA-ICA, CREEP, MA-GRPO) and relies on unverified assumptions about poisoning feasibility and LLM agent capability. No independent evidence or formal verification is provided for these entities.

free parameters (1)
  • MA-GRPO training hyperparameters
    Parameters controlling the reinforcement learning process for agent fine-tuning; values not specified in abstract.
axioms (2)
  • domain assumption External knowledge bases can be poisoned via document injection without immediate detection
    Required for the attack to reach the retrieval stage in real deployments.
  • domain assumption LLM agents can reliably generate documents that are both retrievable and token-intensive
    Core premise of the CREEP framework.
invented entities (3)
  • RA-ICA no independent evidence
    purpose: New attacking paradigm targeting RAG inference cost
    Introduced as the overall attack method.
  • CREEP no independent evidence
    purpose: Framework using LLM agents to craft malicious documents
    Operationalizes the poisoning attack.
  • MA-GRPO no independent evidence
    purpose: Memory-augmented reinforcement learning algorithm for agent training
    Novel RL component claimed to improve attack potency.

pith-pipeline@v0.9.1-grok · 5793 in / 1635 out tokens · 29842 ms · 2026-06-28T16:53:49.246074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 21 canonical work pages · 8 internal anchors

  1. [1]

    Chanwoo Choi, Jinsoo Kim, Sukmin Cho, Soyeong Jeong, and Buru Chang. 2025. The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems. arXiv:2502.20995 [cs] doi:10.48550/arXiv.2502.20995

  2. [2]

    DeepSeek. 2025. DeepSeek-R1 Update: Deeper Thinking, Stronger Reasoning. https://api-docs.deepseek.com/zh-cn/news/news250528. Accessed: 2025-10-06

  3. [3]

    Jianshuo Dong, Ziyuan Zhang, Qingjie Zhang, Tianwei Zhang, Hao Wang, Hewu Li, Qi Li, Chao Zhang, Ke Xu, and Han Qiu. 2024. An engorgio prompt makes large language model babble on.arXiv preprint arXiv:2412.19394(2024)

  4. [4]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Association for Computing Machinery, New York, NY, USA, 6491–6501. doi...

  5. [5]

    Luciano Floridi. 2023. AI as agency without intelligence: On ChatGPT, large language models, and other generative models.Philosophy & technology36, 1 (2023), 15

  6. [6]

    Google Cloud Platform. 2025. Generative AI samples for Google Cloud. https://github.com/GoogleCloudPlatform/generative-ai. Accessed: 2025-09-26

  7. [7]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving with the MATH Dataset.arXiv preprint arXiv:2103.03874(2021). https://arxiv.org/abs/2103.03874

  8. [8]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models. ICLR1, 2 (2022), 3

  9. [9]

    Mikhail Isaev, Nic McDonald, and Richard Vuduc. 2023. Scaling infrastructure to support multi-trillion parameter LLM training. InArchitecture and System Support for Transformer Models (ASSYST@ ISCA 2023)

  10. [10]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022. Unsupervised Dense Information Retrieval with Contrastive Learning. arXiv:2112.09118 [cs.IR] https://arxiv.org/abs/2112.09118

  11. [11]

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. 2024. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv:2403.14403 [cs.CL] https://arxiv.org/abs/2403.14403

  12. [12]

    Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, and Qing Li. 2025. QA-Dragon: Query-Aware Dynamic RAG System for Knowledge-Intensive Visual Question Answering. arXiv preprint arXiv:2508.05197(2025)

  13. [13]

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 7969–7992

  14. [14]

    Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey.Journal of artificial intelligence research4 (1996), 237–285

  15. [15]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG] https://arxiv.org/abs/1412.6980

  16. [16]

    Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. 2025. Overthink: Slowdown attacks on reasoning llms. arXiv preprint arXiv:2502.02542(2025)

  17. [17]

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics7 (2019), 453–466

  18. [18]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

  19. [19]

    Dazhen Li, Siru Xia, Ling Gui, Shiyang Cheng, Yuxuan Zhang, Bailin Wang, Haotian Qi, Jian Han, Yushi He, Qipeng Ma, Jing Zhang, Zhiyong Yang, Yuu Zhou, Jin Shang, Jian-Guang Mao, Lidong Wang, and Xia Zou. 2024. LiveCodeBench: A Challenge for Real-Time Human-Level Coding Competition.arXiv preprint arXiv:2403.07974(2024). https://arxiv.org/abs/2403.07974

  20. [20]

    Jiatong Li, Yunqing Liu, Wenqi Fan, Xiao-Yong Wei, Hui Liu, Jiliang Tang, and Qing Li. 2024. Empowering molecule discovery for molecule-caption translation with large language models: A chatgpt perspective.IEEE transactions on knowledge and data engineering36, 11 (2024), 6071–6083

  21. [21]

    Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. 2024. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. InForty-first International Conference on Machine Learning

  22. [22]

    Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah. 2023. Clinfo. ai: An open-source retrieval-augmented large language model system for answering medical questions using scientific literature. InPacific Symposium on Biocomputing 2024. World Scientific, 8–23

  23. [23]

    Linyin Luo, Yujuan Ding, Yunshan Ma, Wenqi Fan, and Hanjiang Lai. 2025. HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation.arXiv preprint arXiv:2511.15435(2025)

  24. [24]

    Hanene FZ Meftah, Wassim Hamidouche, Sid Ahmed Fezza, and Olivier Deforges. 2025. Energy-Latency Attacks: A New Adversarial Threat to Deep Learning.arXiv preprint arXiv:2503.04963(2025)

  25. [25]

    Microsoft. 2025. What is Azure AI Search? https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search. Accessed: 2025-09-26

  26. [26]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)

  27. [27]

    Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. 2025. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 6140–6150

  28. [28]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

  29. [29]

    David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350(2021)

  30. [30]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling

  31. [31]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  32. [32]

    Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. 2021. Sponge examples: Energy-latency attacks on neural networks. In2021 IEEE European symposium on security and privacy (EuroS&P). IEEE, 212–231

  33. [33]

    Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation.arXiv preprint arXiv:2104.07567 (2021)

  34. [34]

    Vectara. 2025. Grounded Generation overview. https://docs.vectara.com/docs/learn/grounded-generation/grounded-generation-overview. Accessed: 2025-09-26

  35. [35]

    Ante Wang, Linfeng Song, Ge Xu, and Jinsong Su. 2023. Domain adaptation for conversational query production with the rag model feedback. InFindings of the Association for Computational Linguistics: EMNLP 2023. 9129–9141

  36. [36]

    Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, et al. 2024. Searching for best practices in retrieval-augmented generation.arXiv preprint arXiv:2407.01219(2024)

  37. [37]

    Jing Xu, Arthur Szlam, and Jason Weston. 2021. Beyond goldfish memory: Long-term open-domain conversation.arXiv preprint arXiv:2107.07567(2021)

  38. [38]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600(2018)

  39. [39]

    Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. 2024. Rankrag: Unifying context ranking with retrieval- augmented generation in llms.Advances in Neural Information Processing Systems37 (2024), 121156–121184

  40. [40]

    Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing dense retrieval model training with hard negatives. InProceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 1503–1512. WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates Chengliang Liu, Liangbo Ning...

  41. [41]

    Yuanhe Zhang, Zhenhong Zhou, Wei Zhang, Xinyue Wang, Xiaojun Jia, Yang Liu, and Sen Su. 2024. Crabs: Consuming resource via auto-generation for llm-dos attack under black-box settings.arXiv preprint arXiv:2412.13879(2024)

  42. [42]

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. 2024. Retrieval-augmented generation for ai-generated content: A survey.arXiv preprint arXiv:2402.19473(2024)

  43. [43]

    Zihuai Zhao, Wenqi Fan, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Zhen Wen, Fei Wang, Xiangyu Zhao, Jiliang Tang, et al. 2024. Recommender systems in the era of large language models (llms).IEEE Transactions on Knowledge and Data Engineering36, 11 (2024), 6889–6907

  44. [44]

    PoisonedRAG : Knowledge corruption attacks to retrieval-augmented generation of large language models

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2024. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. USENIX Security 2025. arXiv:2402.07867 [cs] doi:10.48550/arXiv.2402.07867 /uni00000013/uni00000011/uni00000013/uni00000013/uni00000011/uni00000015/uni00000013/uni00000011/uni00000017/uni00000013/...

  45. [45]

    Ground Truth Answer

    Treat the "Ground Truth Answer" as absolute truth. [...]

  46. [46]

    Ground Truth Answer

    DO NOT evaluate the "Ground Truth Answer". [...]

  47. [47]

    [...] [

    Your judgment depends ONLY on semantic equivalence. [...] [... Further instructions, rules, and examples omitted for brevity. ...] Now, please judge the answer based on the question and the ground truth answer: Question: {question} Ground Truth Answer: {ground_truth} Generated Answer: {generated_answer} Response: B.2 Document Manipulation Prompts Direct-R...