pith. sign in

arxiv: 2602.00979 · v2 · pith:UXVMTN4Snew · submitted 2026-02-01 · 💻 cs.CR · cs.AI· cs.CL

GradingAttack: Exposing Security Vulnerabilities in LLM Based Educational Grading Agents

Pith reviewed 2026-05-25 06:59 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords adversarial attackLLM securityeducational agentsautomatic gradingshort answer gradingprompt manipulationtoken-level attack
0
0 comments X

The pith

Adversarial prompt and token changes can alter LLM grading outcomes with high success and stealth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GradingAttack to demonstrate how LLM-based agents for short-answer grading can be manipulated. It develops token-level and prompt-level strategies that change the grades assigned while keeping the input changes difficult to notice. Experiments across multiple datasets show both strategies succeed in compromising the agents, with prompt-level versions succeeding more often and token-level versions proving stealthier. This matters because such agents are already used to assess student work in educational settings, where manipulated grades could affect fairness and trust. The findings indicate that these systems currently have no strong built-in protections against targeted interference.

Core claim

GradingAttack is a fine-grained adversarial attack framework that systematically evaluates the security vulnerabilities of LLM based educational grading agents. It designs token-level and prompt-level attack strategies that manipulate agent grading outcomes while maintaining high stealth. Experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth capability. This reveals that current LLM based educational agents lack robust defenses against adversarial attacks.

What carries the argument

GradingAttack framework with token-level and prompt-level attack strategies that manipulate grading outcomes while maintaining high stealth.

If this is right

  • LLM grading agents can have their outputs changed by adversarial inputs.
  • Prompt-level attacks tend to succeed more frequently at altering grades.
  • Token-level attacks are harder for observers to detect.
  • Educational LLM agents require additional security measures to be trustworthy.
  • Automated grading carries risks from undetected manipulation in real use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could add adversarial testing using these attack types before releasing grading agents.
  • Similar manipulation risks may apply to other LLM agents making assessment or feedback decisions.
  • Detection methods focused on input pattern anomalies could reduce success of these attacks.
  • Defenses might need to be specific to short-answer grading rather than general LLM security.

Load-bearing premise

The tested LLM grading agents and datasets are representative of real-world educational deployments, and the attack success rates will hold when the agents are used in live classroom settings rather than controlled experiments.

What would settle it

A live deployment of an LLM grading agent in an actual classroom where attempted prompt-level and token-level attacks fail to change grades or are reliably detected by standard review processes.

Figures

Figures reproduced from arXiv: 2602.00979 by Xueyi Li, Yongdong Wu, Zhuoneng Zhou, Zitao Liu.

Figure 1
Figure 1. Figure 1: Illustration of an attack on an LLM based ASAG model, demonstrating successful manip [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of our GradingAttack framework. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparisons of token-level (dash line) and prompt-level attack methods on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The impact of attack on different labels. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of role-play string placement on performance. R, S, and P represent the role-play strings, student answer and grading prompt, re￾spectively, with their order indicating the relative placement. To further investigate the impact of role-play string placement on attack effectiveness in our GradingAttackRole method, we conduct exper￾iments by varying the position of role-play strings in the adversarial … view at source ↗
Figure 6
Figure 6. Figure 6: An example prompt for LLM based ASAG tasks. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A complete grading process. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed as educational agents for automatic short answer grading (ASAG) in real-world educational environments, significantly boosting assessment efficiency and scalability. However, when these grading agents operate ``in the wild'', their vulnerability to adversarial manipulation raises critical concerns about agent security and trustworthiness. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the security vulnerabilities of LLM based educational grading agents. Specifically, we design token-level and prompt-level attack strategies that manipulate agent grading outcomes while maintaining high stealth, exposing fundamental weaknesses in current agent deployments. Experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth capability. Our findings reveal that current LLM based educational agents lack robust defenses against adversarial attacks, underscoring the urgent need for developing secure and trustworthy agent systems for critical educational applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GradingAttack, a fine-grained adversarial attack framework for LLM-based automatic short answer grading (ASAG) agents. It proposes token-level and prompt-level attack strategies that manipulate grading outcomes while aiming for high stealth, and claims that experiments on multiple datasets show both strategies effectively compromise the agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth.

Significance. If the empirical results hold with proper documentation, the work is significant for highlighting security risks in deployed educational AI systems, potentially motivating defenses for fairness and trustworthiness in automated assessment. The empirical attack framework, if reproducible with clear metrics and baselines, would contribute to the growing literature on LLM vulnerabilities in critical applications.

major comments (2)
  1. [Abstract] Abstract: the claim that 'experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents' provides no details on the LLMs tested, the datasets, the definition or computation of attack success rates, baselines, or any statistical significance tests. This absence makes it impossible to assess whether the data support the central claim.
  2. [Abstract] Abstract: the reported success rates rest on the untested assumption that the specific grading agents (LLMs + prompts) are representative; there is no indication that experiments varied system prompts, incorporated chain-of-thought reasoning, few-shot examples, or fine-tuned models, which could sharply reduce attack effectiveness in real deployments.
minor comments (1)
  1. [Abstract] The abstract refers to 'high stealth' and 'superior stealth capability' without defining the metric (e.g., detection rate by humans or other LLMs) or how it was measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and outline planned revisions to strengthen the presentation of our experimental claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents' provides no details on the LLMs tested, the datasets, the definition or computation of attack success rates, baselines, or any statistical significance tests. This absence makes it impossible to assess whether the data support the central claim.

    Authors: We agree that the abstract, as a high-level summary, omits these specifics. The manuscript body (Sections 3.2, 4.1, and 5) details the LLMs evaluated (GPT-3.5-Turbo, GPT-4, Llama-2-7B), the datasets (SciEntsBank, Beetle, and two additional ASAG corpora), the attack success rate metric (fraction of responses whose assigned grade is altered to the attacker-chosen target), the baseline comparisons (random token replacement and prompt paraphrasing), and the use of paired t-tests for significance. To improve accessibility, we will revise the abstract to include a concise clause summarizing the LLMs, datasets, and primary success-rate definition. revision: yes

  2. Referee: [Abstract] Abstract: the reported success rates rest on the untested assumption that the specific grading agents (LLMs + prompts) are representative; there is no indication that experiments varied system prompts, incorporated chain-of-thought reasoning, few-shot examples, or fine-tuned models, which could sharply reduce attack effectiveness in real deployments.

    Authors: This is a fair observation. Our experiments used standard zero-shot system prompts with the listed off-the-shelf models; we did not ablate system-prompt wording, add chain-of-thought, few-shot exemplars, or evaluate fine-tuned graders. We will add an explicit limitations paragraph acknowledging that more elaborate prompting or fine-tuning could increase robustness, and we will frame the current results as evidence of vulnerabilities in typical current deployments rather than claiming universal representativeness. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical attack evaluation with no derivations or self-referential reductions

full rationale

The paper introduces GradingAttack as an empirical adversarial framework consisting of token-level and prompt-level strategies, evaluated via experiments on multiple datasets. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided abstract or described structure. The central claims rest on reported attack success rates from controlled experiments rather than any self-definitional or load-bearing reduction to inputs. This matches the default expectation for non-circular empirical security papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described beyond the attack framework name itself.

pith-pipeline@v0.9.0 · 5695 in / 1052 out tokens · 24530 ms · 2026-05-25T06:59:09.081668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 4 internal anchors

  1. [1]

    Rama Sree, and M

    Sridevi Bonthu, S. Rama Sree, and M. H. M. Krishna Prasad. Automated short answer grading using deep learning: A survey. InProceedings of the International Cross-Domain Conference for Machine Learning and Knowledge Extraction, Virtual Event, August 2021

  2. [2]

    The eras and trends of automatic short answer grading

    Steven Burrows, Iryna Gurevych, and Benno Stein. The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25:60–117, 2015

  3. [3]

    Automatic short answer grading for finnish with chatgpt

    Li-Hsin Chang and Filip Ginter. Automatic short answer grading for finnish with chatgpt. InProceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, March 2024

  4. [4]

    Using large language models for automated grading of student writing about science.International Journal of Artificial Intelligence in Education, pages 1–35, 2025

    Impey Chris, Wenger Matthew, Garuda Nikhil, Golchin Shahriar, and Stamer Sarah. Using large language models for automated grading of student writing about science.International Journal of Artificial Intelligence in Education, pages 1–35, 2025

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  6. [6]

    Security and privacy challenges of large language models: A survey.ACM Computing Surveys, pages 1–34, 2025

    Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey.ACM Computing Surveys, pages 1–34, 2025

  7. [7]

    nswvt- nvakgxpm

    Yuning Ding, Brian Riordan, Andrea Horbach, Aoife Cahill, and Torsten Zesch. Don’t take “nswvt- nvakgxpm” for an answer–the surprising vulnerability of automatic content scoring systems to adversarial input. InProceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, December 2020

  8. [8]

    SemEval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge

    Myroslava Dzikovska, Rodney Nielsen, Chris Brew, Claudia Leacock, Danilo Giampiccolo, Luisa Ben- tivogli, Peter Clark, Ido Dagan, and Hoa Trang Dang. SemEval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. InProceedings of the 7th International Workshop on Semantic Evaluation, Atlanta, Georgia, USA, June 2013

  9. [9]

    Cheating automatic short answer grading with the adversarial usage of adjectives and adverbs.International Journal of Artificial Intelligence in Education, 34:616–646, 2024

    Anna Filighera, Sebastian Ochs, Tim Steuer, and Thomas Tregel. Cheating automatic short answer grading with the adversarial usage of adjectives and adverbs.International Journal of Artificial Intelligence in Education, 34:616–646, 2024

  10. [10]

    Fooling automatic short answer grading systems

    Anna Filighera, Tim Steuer, and Christoph Rensing. Fooling automatic short answer grading systems. In Proceedings of the 21st International Conference on Artificial Intelligence in Education, Ifrane, Morocco, July 2020. 10

  11. [11]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of 34th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Virtual Event, December 2021

  12. [12]

    Cheating during the college years: How do business school students compare?Journal of Business Ethics, 72:197–206, 2007

    Helen A Klein, Nancy M Levenburg, Marie McKendall, and William Mothersell. Cheating during the college years: How do business school students compare?Journal of Business Ethics, 72:197–206, 2007

  13. [13]

    A multilingual dataset of adversarial attacks to automatic content scoring systems

    Ronja Laarmann-Quante, Christopher Chandler, Noemi Incirkus, Vitaliia Ruban, Alona Solopov, and Luca Steen. A multilingual dataset of adversarial attacks to automatic content scoring systems. InProceedings of the 20th Conference on Natural Language Processing, Vienna, Austria, September 2024

  14. [14]

    Mwptoolkit: An open-source framework for deep learning-based math word problem solvers

    Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan, Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Ee-Peng Lim. Mwptoolkit: An open-source framework for deep learning-based math word problem solvers. InProceedings of the 36th AAAI Conference on Artificial Intelligence, Virtual Event, February 2022

  15. [15]

    Advancing adversarial suffix transfer learning on aligned large language models

    Hongfu Liu, Yuxi Xie, Ye Wang, and Michael Shieh. Advancing adversarial suffix transfer learning on aligned large language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, November 2024

  16. [16]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. InProceedings of the 12th International Conference on Learning Representations, Vienna, Austria, May 2024

  17. [17]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023

  18. [18]

    Comuniqa : Exploring large language models for improving speaking skills

    Manas Mhasakar, Shikhar Sharma, Apurv Mehra, Utkarsh Venaik, Ujjwal Singhal, Dhruv Kumar, and Kashish Mittal. Comuniqa : Exploring large language models for improving speaking skills. InProceedings of the 7th ACM SIGCAS/SIGCHI Conference of Computing and Sustainable Societies, New Delhi, India, July 2024

  19. [19]

    A survey on deep learning-based automated essay scoring and feedback generation.Artificial Intelligence Review, 58:1–40, 2025

    Haile Misgna, Byung-Won On, Ingyu Lee, and Gyu Sang Choi. A survey on deep learning-based automated essay scoring and feedback generation.Artificial Intelligence Review, 58:1–40, 2025

  20. [20]

    Autotutor meets large language models: A language model tutor with rich pedagogy and guardrails

    Sankalan Pal Chowdhury, Vilém Zouhar, and Mrinmaya Sachan. Autotutor meets large language models: A language model tutor with rich pedagogy and guardrails. InProceedings of the 11th ACM Conference on Learning @ Scale, New York, NY , USA, July 2024

  21. [21]

    Embeddings for automatic short answer grading: A scoping review.IEEE Transactions on Learning Technologies, 16:219–231, 2023

    Marko Putnikovic and Jelena Jovanovic. Embeddings for automatic short answer grading: A scoping review.IEEE Transactions on Learning Technologies, 16:219–231, 2023

  22. [22]

    Abscribe: Rapid exploration & organization of multiple writing variations in human-ai co-writing tasks using large language models

    Mohi Reza, Nathan M Laundry, Ilya Musabirov, Peter Dushniku, Zhi Yuan “Michael” Yu, Kashish Mittal, Tovi Grossman, Michael Liut, Anastasia Kuzminykh, and Joseph Jay Williams. Abscribe: Rapid exploration & organization of multiple writing variations in human-ai co-writing tasks using large language models. InProceedings of the 2024 CHI Conference on Human ...

  23. [23]

    Enhancing short answer grading with openai apis

    Sebastian Speiser and Annegret Weng. Enhancing short answer grading with openai apis. InProceedings of the 21st International Conference on Information Technology Based Higher Education and Training, Paris, France, November 2024

  24. [24]

    Unimodal regularisation based on beta distribution for deep ordinal regression.Pattern Recognition, 122:1–10, February 2022

    Pedro Antonio Gutiérrez Víctor Manuel Vargas and César Hervás-Martínez. Unimodal regularisation based on beta distribution for deep ordinal regression.Pattern Recognition, 122:1–10, February 2022

  25. [25]

    Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387, 2023

    Zeming Wei, Yifei Wang, and Yisen Wang. Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv preprint arXiv:2310.06387, 2023

  26. [26]

    Factors associated with cheating among college students: A review.Research in Higher Education, 39:235–274, 1998

    Bernard E Whitley. Factors associated with cheating among college students: A review.Research in Higher Education, 39:235–274, 1998

  27. [27]

    An llm can fool itself: A prompt-based adversarial attack.arXiv preprint arXiv:2310.13345, 2023

    Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, and Mohan Kankanhalli. An llm can fool itself: A prompt-based adversarial attack.arXiv preprint arXiv:2310.13345, 2023

  28. [28]

    Evaluating the Performance of Large Language Models on GAOKAO Benchmark

    Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark.arXiv preprint arXiv:2305.12474, 2023. 11

  29. [29]

    Boosting jailbreak attack with momentum

    Yihao Zhang and Zeming Wei. Boosting jailbreak attack with momentum. InProceedings of the ICLR 2024 Workshop on Reliable and Responsible Foundation Models, Vienna, Austria, May 2024

  30. [30]

    A survey of recent backdoor attacks and defenses in large language models.Transactions on Machine Learning Research, pages 1–28, 2025

    Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, XIAOYU XU, Xiaobao Wu, Jie Fu, Feng Yichao, Fengjun Pan, and Anh Tuan Luu. A survey of recent backdoor attacks and defenses in large language models.Transactions on Machine Learning Research, pages 1–28, 2025

  31. [31]

    Universal vulnerabilities in large language models: Backdoor attacks for in-context learning

    Shuai Zhao, Meihuizi Jia, Luu Anh Tuan, Fengjun Pan, and Jinming Wen. Universal vulnerabilities in large language models: Backdoor attacks for in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, November 2024

  32. [32]

    Can llm replace stack overflow? a study on robustness and reliability of large language model code generation

    Li Zhong and Zilong Wang. Can llm replace stack overflow? a study on robustness and reliability of large language model code generation. InProceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, February 2024

  33. [33]

    Virtual context enhancing jailbreak attacks with special token injection

    Yuqi Zhou, Lin Lu, Ryan Sun, Pan Zhou, and Lichao Sun. Virtual context enhancing jailbreak attacks with special token injection. InFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 2024

  34. [34]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 12 A Effectiveness of CAS Metric Evaluating adversarial attack performance often relies solely on the ASR. However, this metric alone is insufficient to ref...