GrandGuard supplies the first taxonomy, 10k-example benchmark, and fine-tuned safeguards targeting contextual safety failures unique to older adults using chatbots.
hub Canonical reference
Artificial intelligence - carrying us into the future
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
SafePyramid is a three-level benchmark showing frontier LLMs identify all violated rules in only 54.0%, 35.3%, and 12.9% of cases on L0, L1, and L2 respectively, indicating in-context policy guardrailing remains difficult.
FinRED creates an expert-validated benchmark and rubric for financial LLM safety that maps regulatory standards to specific threats and reduces critical false negatives in evaluation from 28 to 12.
Introduces nexbax, a diagnostic framework with three themes and 10 dimensions for evaluating AI economic viability, operational practicality, and societal integrity in next-billion-user contexts.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
EvalCards is a composable reporting schema and monitoring tool for AI evaluations, derived from 52 papers and 10 interviews, and applied to 5,816 models and 101,843 results to surface reporting gaps.
Proportion of unsafe images in training data directly increases unsafe outputs in text-to-image models, independent of absolute count, with complementary risk reduction from safer text encoders.
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
A retrieval approach identifies anomalous dimensions in a set of query vectors and retrieves database vectors that are anomalous across those dimensions, with performance improving as query set size grows to around 8.
Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.
LLM safety judges are inconsistent and disagree with each other, especially on regulated-domain advice like finance, while performing better on overt harms like violence; inconsistency also varies by language and style.
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Generative AI boosts attackers' ability to create harmful content at scale while also enabling defenders to detect threats, support users, and improve moderation processes.
Ethnographic study of feminist civic-tech data work argues reparative AI dataset production requires resetting accountability ties to center those harmed by current practices.
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
citing papers explorer
No citing papers match the current filters.