Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
An llm can fool itself: A prompt-based adversarial attack
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Poller reduces LLM-human disagreement in evaluating Chinese poetry understanding by having LLMs role-play as authors, with reported error reductions of 94.55% and 89.53% on rhetorical techniques and defamiliarization.
PQR framework generates diverse realistic queries to elicit QA agent failures, uncovering 23-78% more unhelpful responses than prior methods in e-commerce agent tests.
Presents GradingAttack with token- and prompt-level adversarial attacks that compromise LLM educational grading agents on multiple datasets, showing prompt-level attacks succeed more while token-level are stealthier.
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
citing papers explorer
-
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.