A behavioral framework operationalizes six dimensions of LLM reasoning quality and shows they are largely independent from accuracy, revealing issues with single-metric evaluation.
Robustness in large language models: A survey of mitigation strategies and evaluation metrics
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Introduces perturbation-based robustness evaluation and hybrid masking-adversarial training to reduce reliance on spurious topical cues while preserving methodological signals in biomedical publication type classification.
Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.
Debiasing via fine-tuning can enhance LLM robustness to semantically neutral prompt perturbations by addressing perturbation-induced bias in neural network outputs.
citing papers explorer
-
Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts
Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.