pith. sign in

arxiv: 2606.03090 · v2 · pith:LSWJCVZFnew · submitted 2026-06-02 · 💻 cs.CR · cs.AI

"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

Pith reviewed 2026-06-28 09:59 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords prompt injectionLLM securityautomatic gradingeducational AIvulnerabilityrubric-based assessment
0
0 comments X

The pith

LLM-based automatic grading systems remain highly vulnerable to prompt injection attacks that inflate scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how attackers can use prompt injection to manipulate LLM-based automatic grading systems into assigning high scores to low-quality student answers. Experiments test these attacks in realistic rubric-based grading setups and evaluate common defensive strategies. Results indicate that existing systems can be reliably tricked regardless of actual answer content. This exposes risks to fairness and trust in AI-supported education. The work calls for greater attention to security in such tools.

Core claim

Through comprehensive experiments under rubric-based grading settings, current LLM-based AG systems remain highly vulnerable to PI attacks, allowing manipulation into artificially high scores irrespective of answer quality.

What carries the argument

Prompt injection attacks that insert overriding instructions into student submissions to bypass the provided grading rubric.

If this is right

  • Attackers can obtain undeserved high grades by embedding injection prompts in submissions.
  • Existing defensive strategies fail to prevent score manipulation in rubric-based grading.
  • The reliability of LLM-based assessment for educational fairness and integrity is undermined.
  • Secure design practices are required before wider deployment of such grading systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Vulnerabilities may appear in other LLM decision systems that process untrusted user text against fixed rules.
  • Hybrid human review layers could serve as a short-term mitigation while stronger technical defenses are developed.
  • Testing protocols for new AG systems should include systematic prompt injection evaluations as standard practice.

Load-bearing premise

The prompt injection attacks and defensive strategies tested under rubric-based settings represent real-world educational use cases and attacker capabilities.

What would settle it

A controlled test in which an LLM grader assigns scores consistent with the rubric even after multiple prompt injection attempts across different models, subjects, and attack variants.

Figures

Figures reproduced from arXiv: 2606.03090 by Fedor Filippov, Hang Li, Hui Liu, Jiliang Tang, Kaiqi Yang, Pengfei He, Yingqian Cui, Yucheng Chu, Yuping Lin.

Figure 1
Figure 1. Figure 1: A PI attack on an LLM-based AG system. By [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Factors considered in designing attack strate [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Defense strategies based on preventive instruc [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average performance of preventative instruction defenses across attacking methods and grader LLMs. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of attention distributions across regular, attacked, and defended responses using Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of attention distributions across different preventative instructions using Qwen3-4B. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Grading prompt template. it is forbidden from answering the questions di￾rectly and may only manipulate the grading pro￾cess (e.g. attacking the rubric, the reference answer, the question, or the scenario). The attacker also re￾ceives the scores from its previous attempts, letting it iteratively refine the suffix based on feedback. I Preventative Instruction Prompt Tab. 8 summarizes the three preventative … view at source ↗
Figure 8
Figure 8. Figure 8: Attacker model prompt used in the Agentic [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

The emergence of large language models (LLMs) has significantly accelerated recent research on LLM-based automatic grading (AG) systems. Benefiting from the strong instruction-following capabilities and broad prior knowledge of LLMs, educators can deploy AG systems across diverse tasks using only natural language rubrics while achieving satisfactory grading performance. Despite these advantages, new security concerns may also arise. In particular, prompt injection (PI) attacks have recently become a major threat to LLM-based applications. In the context of AG, attackers can potentially exploit PI vulnerabilities to manipulate grading systems into assigning artificially high scores regardless of the actual answer quality. Such behavior poses serious risks to the fairness, reliability, and integrity of educational assessment. In this work, we study PI attacks in AG systems, and systematically investigate the effectiveness of such attacks in educational scenarios. We further evaluate the effectiveness of existing defensive strategies against these attacks. Through comprehensive experiments under rubric-based grading settings, we demonstrate that current LLM-based AG systems remain highly vulnerable to PI attacks. We hope that our findings raise awareness of this emerging threat and motivate future research toward secure, robust, and trustworthy LLM-based educational systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines prompt injection (PI) attacks against LLM-based automatic grading (AG) systems. It argues that attackers can use PI to force artificially high scores irrespective of answer quality, studies the effectiveness of such attacks in educational scenarios, evaluates existing defenses, and concludes via comprehensive experiments under rubric-based grading settings that current LLM-based AG systems remain highly vulnerable to PI.

Significance. If the empirical findings hold and generalize, the work identifies a concrete and practically relevant security threat to an emerging class of educational tools. By focusing on rubric-based AG and measuring attack success against defensive baselines, it supplies falsifiable evidence that can motivate follow-on research on robust prompt engineering, access controls, and verification layers for LLM graders. The absence of mathematical derivations or parameter-free claims is appropriate for an empirical security study.

major comments (2)
  1. [Abstract / Experimental Setup] Abstract and § on experimental design: the central claim that 'current LLM-based AG systems remain highly vulnerable' is supported only by experiments conducted exclusively under rubric-based grading settings. No evidence is supplied that the tested rubric formats, LLM backends, attack vectors, or defensive baselines adequately sample deployed AG systems (e.g., those employing chain-of-thought grading, multi-turn interaction, or institution-specific templates). This representativeness gap is load-bearing for the broad conclusion.
  2. [Evaluation] Evaluation section: the paper reports attack success rates and defense effectiveness, yet provides no quantitative comparison against non-rubric baselines or real-world deployment logs. Without such controls, it is unclear whether the measured vulnerability is an artifact of the chosen rubric-based prompt structure rather than an inherent property of LLM-based AG.
minor comments (2)
  1. [Title] Title formatting: the leading quotation marks and bolded example prompt are unconventional and may confuse readers; consider moving the example to the introduction or abstract.
  2. [Introduction] Notation: the manuscript uses 'PI' and 'AG' without an initial definition paragraph; add a short glossary or first-use expansion for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major comments point-by-point below, proposing targeted revisions to clarify scope and limitations while preserving the paper's empirical focus on rubric-based settings.

read point-by-point responses
  1. Referee: [Abstract / Experimental Setup] Abstract and § on experimental design: the central claim that 'current LLM-based AG systems remain highly vulnerable' is supported only by experiments conducted exclusively under rubric-based grading settings. No evidence is supplied that the tested rubric formats, LLM backends, attack vectors, or defensive baselines adequately sample deployed AG systems (e.g., those employing chain-of-thought grading, multi-turn interaction, or institution-specific templates). This representativeness gap is load-bearing for the broad conclusion.

    Authors: We agree that all experiments are confined to rubric-based grading, as stated throughout the manuscript including the abstract. The central claim is scoped to this setting, which is a standard and widely deployed approach for LLM-based AG. To address the concern, we will revise the abstract, introduction, and conclusion to explicitly qualify the findings as applying to rubric-based systems and add a dedicated limitations paragraph acknowledging that other paradigms (chain-of-thought, multi-turn, institution-specific) are not tested and may differ. No new experiments are feasible within the current study. revision: yes

  2. Referee: [Evaluation] Evaluation section: the paper reports attack success rates and defense effectiveness, yet provides no quantitative comparison against non-rubric baselines or real-world deployment logs. Without such controls, it is unclear whether the measured vulnerability is an artifact of the chosen rubric-based prompt structure rather than an inherent property of LLM-based AG.

    Authors: The paper's scope is deliberately limited to rubric-based grading as the primary practical deployment model for natural-language rubrics. Non-rubric baselines are outside this scope and were not evaluated. Real-world deployment logs are unavailable for this controlled study; we will add an explicit limitations discussion noting this gap and recommending future work on diverse grading methods and field data. This does not alter the reported results for the rubric-based case. revision: partial

Circularity Check

0 steps flagged

Empirical security study exhibits no circularity

full rationale

The paper conducts an experimental evaluation of prompt injection attacks on LLM-based automatic grading systems under rubric-based settings. It contains no mathematical derivations, parameter fittings, uniqueness theorems, or self-citations that serve as load-bearing premises. The central claim of vulnerability is grounded directly in the reported experiments rather than reducing to any input by construction or self-reference. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical security analysis relying on standard assumptions about LLM behavior, attack feasibility, and the representativeness of rubric-based grading tasks; no free parameters, axioms, or invented entities are specified.

pith-pipeline@v0.9.1-grok · 5762 in / 897 out tokens · 25386 ms · 2026-06-28T09:59:03.498058+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 15 canonical work pages · 7 internal anchors

  1. [1]

    GradingAttack: Exposing Security Vulnerabilities in LLM Based Educational Grading Agents

    GradingAttack: Attacking Large Language Models Towards Short Answer Grading Ability , author=. arXiv preprint arXiv:2602.00979 , year=

  2. [2]

    A llm-powered automatic grading framework with human-level guidelines optimization,

    A llm-powered automatic grading framework with human-level guidelines optimization , author=. arXiv preprint arXiv:2410.02165 , year=

  3. [3]

    IEEE Signal Processing Magazine , volume=

    Large language models for education: A survey and outlook , author=. IEEE Signal Processing Magazine , volume=. 2026 , publisher=

  4. [4]

    Do Anything Now

    "Do Anything Now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

  5. [5]

    2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

    Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=

  6. [6]

    Computers, Materials, & Continua , volume=

    Prompt injection attacks on large language models: A survey of attack methods, root causes, and defense strategies , author=. Computers, Materials, & Continua , volume=. 2026 , publisher=

  7. [7]

    arXiv preprint arXiv:2402.14601 , year=

    Bringing generative AI to adaptive learning in education , author=. arXiv preprint arXiv:2402.14601 , year=

  8. [8]

    2025 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE) , pages=

    LLM-based Automated Grading with Human-in-the-Loop , author=. 2025 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE) , pages=. 2025 , organization=

  9. [9]

    Proceedings of the 15th international learning analytics and knowledge conference , pages=

    Automatic short answer grading in the LLM era: Does GPT-4 with prompt engineering beat traditional models? , author=. Proceedings of the 15th international learning analytics and knowledge conference , pages=

  10. [10]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Evaluating the instruction-following robustness of large language models to prompt injection , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  11. [11]

    arXiv preprint arXiv:2504.05276 , year=

    Enhancing llm-based short answer grading with retrieval-augmented generation , author=. arXiv preprint arXiv:2504.05276 , year=

  12. [12]

    arXiv preprint arXiv:2603.00465 , year=

    Optimizing In-Context Demonstrations for LLM-based Automated Grading , author=. arXiv preprint arXiv:2603.00465 , year=

  13. [13]

    33rd USENIX Security Symposium (USENIX Security 24) , pages=

    Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

  14. [14]

    Advances in neural information processing systems , volume=

    Jailbroken: How does llm safety training fail? , author=. Advances in neural information processing systems , volume=

  15. [15]

    and Perez, Ethan and Grosse, Roger and Duvenaud, David , booktitle =

    Anil, Cem and Durmus, Esin and Panickssery, Nina and Sharma, Mrinank and Benton, Joe and Kundu, Sandipan and Batson, Joshua and Tong, Meg and Mu, Jesse and Ford, Daniel and Mosconi, Fracesco and Agrawal, Rajashree and Schaeffer, Rylan and Bashkansky, Naomi and Svenningsen, Samuel and Lambert, Mike and Radhakrishnan, Ansh and Denison, Carson and Hubinger, ...

  16. [16]

    European Symposium on Research in Computer Security , pages=

    Jatmo: Prompt injection defense by task-specific finetuning , author=. European Symposium on Research in Computer Security , pages=. 2024 , organization=

  17. [17]

    OpenAI GPT-5 System Card

    Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

  18. [18]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    PIGuard: Prompt injection guardrail via mitigating overdefense for free , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  19. [19]

    Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , pages=

    Secalign: Defending against prompt injection with preference optimization , author=. Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , pages=

  20. [20]

    Nature Machine Intelligence , volume=

    Defending chatgpt against jailbreak attack via self-reminders , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

  21. [21]

    I understand why I got this grade

    " I understand why I got this grade": Automatic Short Answer Grading with Feedback , author=. arXiv preprint arXiv:2407.12818 , year=

  22. [22]

    Handbook of Research on Science Learning Progressions , year=

    SCHOOL LEVEL , author=. Handbook of Research on Science Learning Progressions , year=

  23. [23]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  24. [24]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  25. [25]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  26. [26]

    2026 , month =

    Anthropic , title =. 2026 , month =

  27. [27]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

  28. [28]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

  29. [29]

    doi:10.57967/hf/8254 , publisher =

    aifeifei , title =. doi:10.57967/hf/8254 , publisher =

  30. [30]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

  31. [31]

    Towards llm-based autograd- ing for short textual answers,

    Towards llm-based autograding for short textual answers , author=. arXiv preprint arXiv:2309.11508 , year=

  32. [32]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=