Healer uses LLMs to dynamically generate and execute runtime error-handling code, with GPT-4 recovering from 72.8% of errors across four datasets.
In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT.CoRR abs/2304.08979
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
PopQuiz Attack infers LLM training data membership by turning examples into quiz questions and measuring answer accuracy, reaching 0.873 average ROC-AUC across six models and outperforming prior methods by 20.6%.
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
People show little ability to distinguish AI-generated from human-written health advice in online communities, with detection varying by health condition and unreliable heuristics.
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
citing papers explorer
-
Towards Agentic Runtime Healing
Healer uses LLMs to dynamically generate and execute runtime error-handling code, with GPT-4 recovering from 72.8% of errors across four datasets.
-
Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models
PopQuiz Attack infers LLM training data membership by turning examples into quiz questions and measuring answer accuracy, reaching 0.873 average ROC-AUC across six models and outperforming prior methods by 20.6%.
-
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
-
Discerning Authorship in Online Health Communities: Experience, Trust, and Transparency Implications for Moderating AI
People show little ability to distinguish AI-generated from human-written health advice in online communities, with detection varying by health condition and unreliable heuristics.
-
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
-
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.