InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11

Safe RLHF: safe reinforcement learning from human feedback · 2024 · arXiv 2502.05163

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

ML-Bench is a multilingual safety benchmark derived from actual regional laws and regulations, paired with ML-Guard guardrail models that outperform 11 baselines on existing and new benchmarks.

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

cs.CR · 2025-12-12 · unverdicted · novelty 6.0

RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.

citing papers explorer

Showing 2 of 2 citing papers.

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models cs.CL · 2026-05-01 · unverdicted · none · ref 5
ML-Bench is a multilingual safety benchmark derived from actual regional laws and regulations, paired with ML-Guard guardrail models that outperform 11 baselines on existing and new benchmarks.
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring cs.CR · 2025-12-12 · unverdicted · none · ref 6
RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.

InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11

fields

years

verdicts

representative citing papers

citing papers explorer