Pappas, Florian Tramèr, et al

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J · 2024 · DOI 10.52202/079017-1745

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

Fanfiction subgenres from AO3 function as universal register-based jailbreaks, raising mean attack success rate from 0.278 to 0.731 across eight aligned LLMs on HarmBench and JailbreakBench.

Quality Is Not a Safety Proxy Under Quantization

cs.LG · 2026-06-08 · conditional · novelty 6.0

Across 51 quantized checkpoints, quality metrics fail to predict safety drops in 36 pairings and 10 hidden-danger cases, while a new RTSI screen routes all 10 dangerous rows to testing at matched bucket size.

Silent Failures in Federated Personalization of Foundation Models

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Federated personalization of foundation models creates hard-to-detect trustworthiness failures due to privacy constraints, and existing benchmarks cannot adequately evaluate them.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs cs.CL · 2026-06-03 · unverdicted · none · ref 4
Fanfiction subgenres from AO3 function as universal register-based jailbreaks, raising mean attack success rate from 0.278 to 0.731 across eight aligned LLMs on HarmBench and JailbreakBench.
Quality Is Not a Safety Proxy Under Quantization cs.LG · 2026-06-08 · conditional · none · ref 3
Across 51 quantized checkpoints, quality metrics fail to predict safety drops in 36 pairings and 10 hidden-danger cases, while a new RTSI screen routes all 10 dangerous rows to testing at matched bucket size.
Silent Failures in Federated Personalization of Foundation Models cs.LG · 2026-05-31 · unverdicted · none · ref 8
Federated personalization of foundation models creates hard-to-detect trustworthiness failures due to privacy constraints, and existing benchmarks cannot adequately evaluate them.

Pappas, Florian Tramèr, et al

fields

years

verdicts

representative citing papers

citing papers explorer