CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2verdicts
UNVERDICTED 2representative citing papers
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.
citing papers explorer
-
Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain
CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.
-
ShieldGemma: Generative AI Content Moderation Based on Gemma
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.