Canonical reference

Artificial intelligence - carrying us into the future

Ghosh, S · 2025 · arXiv 2503.05731

Canonical reference. 80% of citing Pith papers cite this work as background.

9 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 9 citing papers

citation-role summary

background 5

citation-polarity summary

background 4 support 1

representative citing papers

GrandGuard: Taxonomy, Benchmark, and Safeguards for Elderly-Chatbot Interaction Safety

cs.HC · 2026-04-07 · conditional · novelty 8.0

GrandGuard supplies the first taxonomy, 10k-example benchmark, and fine-tuned safeguards targeting contextual safety failures unique to older adults using chatbots.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs

cs.CY · 2026-05-11 · accept · novelty 7.0 · 2 refs

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.

Retrieval with Multiple Query Vectors through Anomalous Pattern Detection

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

A retrieval approach identifies anomalous dimensions in a set of query vectors and retrieves database vectors that are anomalous across those dimensions, with performance improving as query set size grows to around 8.

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

How Generative AI Empowers Attackers and Defenders Across the Trust & Safety Landscape

cs.HC · 2025-11-10 · unverdicted · novelty 5.0

Generative AI boosts attackers' ability to create harmful content at scale while also enabling defenders to detect threats, support users, and improve moderation processes.

From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI

cs.CR · 2026-05-15 · unverdicted · novelty 3.0

The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

cs.AI · 2026-05-07

citing papers explorer

Showing 9 of 9 citing papers.

GrandGuard: Taxonomy, Benchmark, and Safeguards for Elderly-Chatbot Interaction Safety cs.HC · 2026-04-07 · conditional · none · ref 3
GrandGuard supplies the first taxonomy, 10k-example benchmark, and fine-tuned safeguards targeting contextual safety failures unique to older adults using chatbots.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 36 · 2 links
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs cs.CY · 2026-05-11 · accept · none · ref 43 · 2 links
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
Retrieval with Multiple Query Vectors through Anomalous Pattern Detection cs.LG · 2026-05-03 · unverdicted · none · ref 51
A retrieval approach identifies anomalous dimensions in a set of query vectors and retrieves database vectors that are anomalous across those dimensions, with performance improving as query set size grows to around 8.
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety cs.CL · 2026-04-20 · unverdicted · none · ref 44
Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels cs.LG · 2026-05-07 · unverdicted · none · ref 21
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
How Generative AI Empowers Attackers and Defenders Across the Trust & Safety Landscape cs.HC · 2025-11-10 · unverdicted · none · ref 34
Generative AI boosts attackers' ability to create harmful content at scale while also enabling defenders to detect threats, support users, and improve moderation processes.
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI cs.CR · 2026-05-15 · unverdicted · none · ref 45
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unreviewed · ref 4

Artificial intelligence - carrying us into the future

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer