Recognition: no theorem link
Adversarial SQL Injection Generation with LLM-Based Architectures
Pith reviewed 2026-05-13 02:07 UTC · model grok-4.3
The pith
LLM-based systems generate SQL injection payloads that bypass web application firewalls at rates up to 22.73 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce RADAGAS, a retrieval-augmented generation system, and RefleXQLi, a reflective chain-of-thought approach, for producing adversarial SQLi payloads. Across extensive tests, RADAGAS-GPT4o achieves a 22.73 percent bypass rate and outperforms other baselines. The methods reach high success rates on AI/ML-based WAFs yet remain largely ineffective against rule-based ones, and less diverse payload sets sometimes produce more bypasses provided the starting payload succeeds.
What carries the argument
RADAGAS, a retrieval-augmented generation system that pulls relevant examples to steer large language models toward creating effective adversarial SQL injection payloads.
If this is right
- RADAGAS-DeepSeek reaches 92.49 percent bypass on the AI-based WAF Brain.
- RADAGAS-Claude reaches 80.48 percent bypass on the AI-based CNN-WAF.
- Bypass rates on rule-based WAFs such as ModSecurity and Coraza remain between 0 and 5.70 percent.
- Less diverse payload sets can increase bypass counts but perform poorly when the initial payload fails.
Where Pith is reading between the lines
- Security teams could add LLM generators to routine testing suites to simulate advanced attacks against current defenses.
- Rule-based WAF developers may need to add dynamic pattern matching to counter the outputs of these LLM methods.
- Hybrid systems that combine LLM generation with traditional fuzzing could reduce the risk that a single weak starting payload blocks all further attempts.
Load-bearing premise
The 240 experiments with chosen prompts and WAF configurations accurately reflect real-world adversarial conditions without introducing bias from those choices.
What would settle it
Applying the generated payloads against additional live web applications protected by the tested WAFs and measuring whether the reported bypass rates hold under actual traffic.
read the original abstract
SQL injection (SQLi) attacks are still one of the serious attacks ranked in the Open Worldwide Application Security Project (OWASP) Top 10 threats. Today, with advances in Artificial Intelligence (AI), especially in Large Language Models (LLMs), an opportunity has been created for automating adversarial attack tests to measure the defense mechanisms. In this paper, we aim to create a comprehensive evaluation of use cases that utilize LLMs for adversarial SQL injection generation. We introduce two novel LLM-based systems, Retrieval Augmented Generation for Adversarial SQLi (RADAGAS) and Reflective Chain-of-Thought SQLi (RefleXQLi), and compare them with existing baselines against 10 Web Application Firewalls (WAFs) and one execution-based MySQL validator. To perform a comprehensive test, we used six rule-based open-source WAFs (ModSecurity PL1--3, Coraza PL1--3), 2 AI/ML-based WAFs (WAF Brain, CNN-WAF), and 2 commercial WAFs (AWS WAF and Cloudflare WAF). For the LLM models, we used GPT-4o, Claude 3.7 Sonnet, and DeepSeek R1. Our tests consist of 240 experiments that generate 240,000 payloads and perform 2.2 million tests against WAFs. Our comprehensive evaluation reveals that RADAGAS-GPT4o outperforms other baseline models with a 22.73\% bypass rate. The proposed RADAGAS variants are highly successful on AI/ML-based WAFs (92.49\% on WAF-Brain by RADAGAS-DeepSeek, 80.48\% on CNN-WAF by RADAGAS-Claude), but struggle to bypass rule-based WAFs (0--5.70\% on ModSecurity and Coraza). In addition to these findings, another observation is that creating less diverse payloads achieves more bypasses, however they show poor results if the initially chosen payload is not successful. We observe that our findings provide a comprehensive view on using LLM-based approaches in security testing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces two novel LLM-based systems, RADAGAS (Retrieval Augmented Generation for Adversarial SQLi) and RefleXQLi (Reflective Chain-of-Thought SQLi), for generating adversarial SQL injection payloads. It evaluates these against baselines (using GPT-4o, Claude 3.7 Sonnet, and DeepSeek R1) on 10 WAFs (six rule-based open-source, two AI/ML-based, and two commercial) plus a MySQL validator via 240 experiments that produce 240,000 payloads and 2.2 million tests. The central claim is that RADAGAS-GPT4o achieves the highest bypass rate of 22.73%, with strong results on AI/ML WAFs (e.g., 92.49% on WAF-Brain) but low success on rule-based WAFs (0-5.70%), plus the observation that less diverse payloads yield higher bypass rates when the initial payload succeeds.
Significance. If the bypass rates hold, the work supplies a large-scale empirical benchmark on LLM-driven adversarial SQLi generation, quantifying the relative strengths of retrieval-augmented and reflective techniques against different WAF classes. The 2.2 million test scale and coverage of open-source, ML-based, and commercial defenses provide concrete data that could inform both offensive security tooling and defensive WAF hardening, particularly highlighting the comparative weakness of rule-based systems.
major comments (3)
- [Evaluation] Evaluation section: The exact prompt templates, retrieval corpus, and chain-of-thought instructions for RADAGAS and RefleXQLi are not specified. Because the 22.73% bypass claim for RADAGAS-GPT4o and the outperformance over baselines rest on these custom generation procedures, and the skeptic note indicates sensitivity to prompt variations, the absence of reproducible prompt details prevents verification that the reported advantage derives from the architectures rather than hyperparameter or prompt tuning.
- [Results] Results section: Bypass rates (e.g., 22.73% overall for RADAGAS-GPT4o, 92.49% on WAF-Brain) are presented as point estimates without error bars, confidence intervals, or statistical significance tests across the 240 experiments. This omission is load-bearing for the comparative claim, as variance from payload sampling or WAF response stochasticity could alter whether the observed outperformance is reliable.
- [WAF testing setup] WAF testing setup: The paper does not state whether the 10 WAFs were evaluated under default configurations or with any custom rules, thresholds, or versions. Given that bypass performance is known to be sensitive to WAF rule sets (especially for ModSecurity PL1-3 and Coraza), this detail is required to interpret the low rule-based bypass rates (0-5.70%) versus high AI/ML rates as generalizable rather than configuration-specific.
minor comments (2)
- [Discussion] The note that less diverse payloads achieve more bypasses would be strengthened by quantitative diversity metrics (e.g., number of unique payloads or lexical entropy) rather than a qualitative observation.
- [Results] A table summarizing per-WAF and per-model bypass rates with exact experiment counts would improve readability of the 2.2 million test results.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important aspects of reproducibility, statistical rigor, and experimental clarity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims or findings.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The exact prompt templates, retrieval corpus, and chain-of-thought instructions for RADAGAS and RefleXQLi are not specified. Because the 22.73% bypass claim for RADAGAS-GPT4o and the outperformance over baselines rest on these custom generation procedures, and the skeptic note indicates sensitivity to prompt variations, the absence of reproducible prompt details prevents verification that the reported advantage derives from the architectures rather than hyperparameter or prompt tuning.
Authors: We agree that full reproducibility requires the exact prompt templates, retrieval corpus composition, and chain-of-thought instructions. The manuscript describes the high-level architectures of RADAGAS and RefleXQLi in the methods section, but does not include the verbatim prompts or corpus details. In the revised version, we will add a dedicated appendix containing the complete prompt templates for each LLM, the structure and size of the retrieval corpus, and the precise reflective CoT instructions. This will enable independent verification that performance differences arise from the proposed techniques. revision: yes
-
Referee: [Results] Results section: Bypass rates (e.g., 22.73% overall for RADAGAS-GPT4o, 92.49% on WAF-Brain) are presented as point estimates without error bars, confidence intervals, or statistical significance tests across the 240 experiments. This omission is load-bearing for the comparative claim, as variance from payload sampling or WAF response stochasticity could alter whether the observed outperformance is reliable.
Authors: The referee correctly notes that only point estimates are reported. Although the scale of 240 experiments and 2.2 million tests provides substantial empirical support, we did not include error bars, confidence intervals, or formal statistical tests. In the revision, we will compute and report 95% confidence intervals for the bypass rates and add a brief discussion of observed variance across runs. Where direct comparisons are made, we will include appropriate statistical significance indicators to substantiate the outperformance claims. revision: yes
-
Referee: [WAF testing setup] WAF testing setup: The paper does not state whether the 10 WAFs were evaluated under default configurations or with any custom rules, thresholds, or versions. Given that bypass performance is known to be sensitive to WAF rule sets (especially for ModSecurity PL1-3 and Coraza), this detail is required to interpret the low rule-based bypass rates (0-5.70%) versus high AI/ML rates as generalizable rather than configuration-specific.
Authors: We used the default configurations for every WAF: standard ModSecurity and Coraza installations at the stated paranoia levels (PL1–3) with no custom rules added, and the AI/ML-based and commercial WAFs in their out-of-the-box settings. This information was omitted from the manuscript. In the revised evaluation section, we will explicitly state that all tests used default configurations and versions, thereby clarifying that the reported differences between rule-based and AI/ML WAFs reflect standard deployments. revision: yes
Circularity Check
Empirical benchmarking with no derivation chain or fitted predictions
full rationale
The paper conducts a direct empirical evaluation of LLM-based SQLi payload generators (RADAGAS and RefleXQLi) against 10 WAFs and a MySQL validator using 240 experiments that produce 240,000 payloads and 2.2 million tests. No equations, parameters fitted to subsets then renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the reported methodology or results. Bypass rates (e.g., 22.73% for RADAGAS-GPT4o) are measured outcomes from external system interactions, not derived quantities that reduce to the inputs by construction. The study is self-contained against its chosen benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
OWASP Foundation: OWASP Top 10-2021: The Ten Most Critical Web Appli- cation Security Risks (2021). https://owasp.org/Top10/
work page 2021
-
[2]
Open-source penetration testing tool (2006)
Damele, B., Stampar, M.: SQLMAP: Automatic SQL Injection and Database Takeover Tool. Open-source penetration testing tool (2006). https://sqlmap.org/
work page 2006
-
[3]
OpenAI: GPT-4 Technical Report (2024). https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Claude-3 Model Card1(1), 4 (2024)
Anthropic-AI: The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card1(1), 4 (2024)
work page 2024
-
[5]
DeepSeek-AI: DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (2024). https://arxiv.org/abs/2406.11931
-
[6]
IEEE Transactions on Software Engineering47(11), 2312–2331 (2019)
Manes, V.J., Han, H., Han, C., Cha, S.K., Egele, M., Schwartz, E.J., Woo, M.: The art, science, and engineering of fuzzing: A survey. IEEE Transactions on Software Engineering47(11), 2312–2331 (2019)
work page 2019
-
[7]
OWASP: ModSecurity: Open Source Web Application Firewall (2024). https: //modsecurity.org/
work page 2024
-
[8]
Future Internet17(1) (2025) https://doi.org/10.3390/ fi17010008
Babaey, V., Ravindran, A.: GenSQLi: A generative artificial intelligence frame- work for automatically securing web application firewalls against structured query language injection attacks. Future Internet17(1) (2025) https://doi.org/10.3390/ fi17010008
work page 2025
-
[9]
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp
Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., Sun, X., Li, L., Sui, Z.: A survey on in-context learning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1107–
-
[10]
A survey on in-context learning
Association for Computational Linguistics, Miami, Florida, USA (2024). https://doi.org/10.18653/v1/2024.emnlp-main.64
-
[11]
In: IEEE International Symposium on Consumer Elec- tronics (ISCE), pp
Kindy, D.A., Pathan, A.-S.K.: A survey on SQL injection: Vulnerabilities, attacks, and prevention techniques. In: IEEE International Symposium on Consumer Elec- tronics (ISCE), pp. 468–471 (2011). https://doi.org/10.1109/ISCI.2011.5973873
-
[12]
Elsevier, Waltham, MA (2012) 29
Clarke, J.: SQL Injection Attacks and Defense. Elsevier, Waltham, MA (2012) 29
work page 2012
-
[13]
Appelt, D., Nguyen, C.D., Briand, L.: Behind an application firewall, are we safe from SQL injection attacks? In: IEEE International Conference on Software Testing, Verification and Validation (ICST), pp. 1–10 (2015)
work page 2015
-
[14]
Journal of Internet Services and Applications10(1), 1–22 (2019)
Pan, Y., Sun, F., Teng, Z., White, J., Schmidt, D.C., Staples, J., Krause, L.: Detecting web attacks with end-to-end deep learning. Journal of Internet Services and Applications10(1), 1–22 (2019)
work page 2019
-
[15]
https://github.com/BBVA/waf-brain
BBVA-Labs: WAF-Brain: Machine Learning Based Web Application Firewall (2018). https://github.com/BBVA/waf-brain
work page 2018
-
[16]
Applied Sciences14(16) (2024) https://doi.org/10
Gui, Z., Wang, E., Deng, B., Zhang, M., Chen, Y., Wei, S., Xie, W., Wang, B.: SqliGPT: Evaluating and utilizing large language models for automated SQL injection black-box detection. Applied Sciences14(16) (2024) https://doi.org/10. 3390/app14166929
work page 2024
-
[17]
Yang, T., Jiang, Z., Wang, Y.: LLMSQLi: A black-box web SQLi detection tool based on large language model. In: IEEE International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE), pp. 629–633 (2024)
work page 2024
-
[18]
LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning,
Sun, Y., Wu, D., Xue, Y., Liu, H., Ma, W., Zhang, L., Liu, Y., Li, Y.: LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning (2025). https://arxiv.org/abs/2401.16185
-
[19]
In: USENIX Security Symposium (USENIX Security 24), pp
Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In: USENIX Security Symposium (USENIX Security 24), pp. 847–864 (2024). https://www.usenix.org/conference/ usenixsecurity24/presentation/deng
work page 2024
-
[20]
Prompt Injection attack against LLM-integrated Applications
Liu, Y., Deng, G., Li, Y., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Zhang, L.Y., Liu, Y.: Prompt Injection attack against LLM- integrated Applications (2025). https://arxiv.org/abs/2306.05499
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
The Curious Case of Neural Text Degeneration
Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The Curious Case of Neural Text Degeneration (2020). https://arxiv.org/abs/1904.09751
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[22]
Lost in the Middle: How Language Models Use Long Contexts
Meister, C., Pimentel, T., Wiher, G., Cotterell, R.: Locally typical sampling. Transactions of the Association for Computational Linguistics11, 102–121 (2023) https://doi.org/10.1162/tacl a 00536
work page internal anchor Pith review doi:10.1162/tacl 2023
-
[23]
In: Findings of the Association for Computational Linguistics (EMNLP), pp
Renze, M.: The effect of sampling temperature on problem solving in large lan- guage models. In: Findings of the Association for Computational Linguistics (EMNLP), pp. 7346–7356. Association for Computational Linguistics, Miami, Florida, USA (2024). https://doi.org/10.18653/v1/2024.findings-emnlp.432 30
-
[24]
In: Advances in Neural Information Processing Systems, vol
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....
work page 1901
-
[25]
In: USENIX Security Symposium (USENIX Security), pp
Wahaibi, S.A., Foley, M., Maffeis, S.: SQIRL: Grey-Box detection of SQL injection vulnerabilities using reinforcement learning. In: USENIX Security Symposium (USENIX Security), pp. 6097–6114. USENIX Association, Anaheim, CA (2023). https://www.usenix.org/conference/usenixsecurity23/presentation/al-wahaibi
work page 2023
-
[26]
In: IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pp
Hu, Z., Beuran, R., Tan, Y.: Automated penetration testing using deep rein- forcement learning. In: IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pp. 2–10 (2020)
work page 2020
-
[27]
Zhong, R., Chen, Y., Hu, H., Zhang, H., Lee, W., Wu, D.: SQUIRREL: Testing Database Management Systems with Language Validity and Coverage Feedback. In: Proceedings of the ACM SIGSAC Conference on Computer and Communi- cations Security, New York, NY, USA, pp. 955–970 (2020). https://doi.org/10. 1145/3372297.3417260
-
[28]
Zalewski, M.: American fuzzy lop (AFL) fuzzer (2017). https://lcamtuf. coredump.cx/afl
work page 2017
-
[29]
Lemieux, C., Sen, K.: Fairfuzz: a targeted mutation strategy for increasing greybox fuzz testing coverage. In: Proceedings of the ACM/IEEE International Conference on Automated Software Engineering, New York, NY, USA, pp. 475–485 (2018). https://doi.org/10.1145/3238147.3238176
-
[30]
B¨ ohme, M., Pham, V.-T., Roychoudhury, A.: Coverage-based greybox fuzzing as markov chain. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, pp. 1032–1043 (2016). https: //doi.org/10.1145/2976749.2978428
-
[31]
Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval, pp. 335–336. Association for Computing Machinery, New York, NY, USA (1998). https://doi.org/10.1145/290941.291025
-
[32]
In: Advances in Neural Information Processing Systems, vol
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014). https://proceedings.neurips.cc/ paper files/paper/2014/file/f033ed80deb0234979a61f95710dbe25-Paper.pdf 31
work page 2014
-
[33]
In: Advances in Neural Information Processing Systems, vol
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain-of-Thought Prompting Elicits Reasoning in Large Lan- guage Models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022). https://proceedings.neurips.cc/paper files/paper/2022/ file/9d5609613524ecf4f15af0f7b31abca4-Paper-C...
work page 2022
-
[34]
PortSwigger: Web Security Academy: SQL Injection (2024). https://portswigger. net/web-security/sql-injection
work page 2024
-
[35]
Community-maintained security payload repository (2024)
swisskyrepo, contributors: PayloadsAllTheThings: A List of Useful Payloads and Bypass for Web Application Security. Community-maintained security payload repository (2024). https://github.com/swisskyrepo/PayloadsAllTheThings
work page 2024
-
[36]
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3982–3992 (2019). https://doi.org/10.18653/v1/D19-1410
-
[37]
Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: Massive Text Embedding Benchmark. In: Proceedings of the Conference of the European Chap- ter of the Association for Computational Linguistics, Dubrovnik, Croatia, pp. 2014–2037 (2023). https://doi.org/10.18653/v1/2023.eacl-main.148
-
[38]
IEEE Transactions on Big Data7(3), 535–547 (2019)
Johnson, J., Douze, M., J´ egou, H.: Billion-scale similarity search with GPUs. IEEE Transactions on Big Data7(3), 535–547 (2019)
work page 2019
-
[39]
In: Soviet Physics Doklady, vol
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966). Soviet Union
work page 1966
-
[40]
BERTScore: Evaluating Text Generation with BERT
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Eval- uating Text Generation with BERT (2020). https://arxiv.org/abs/1904.09675
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [41]
-
[42]
Pearson, K.: Notes on the History of Correlation. Biometrika13(1), 25–45 (1920)
work page 1920
-
[43]
Biometrika10(4), 507–521 (1915)
Fisher, R.A.: Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika10(4), 507–521 (1915)
work page 1915
-
[44]
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Routledge, New York (2013)
work page 2013
-
[45]
Information Sciences254, 19–38 (2014) https://doi.org/10.1016/j.ins.2013.08.007 32
Razzaq, A., Latif, K., Ahmad, H.F., Hur, A., Anwar, Z., Bloodsworth, P.C.: Semantic security against web application attacks. Information Sciences254, 19–38 (2014) https://doi.org/10.1016/j.ins.2013.08.007 32
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.