InThe Twelfth International Con- ference on Learning Representations

Safe RLHF: Safe reinforcement learning from human feedback · 2025 · arXiv 2502.02467

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

DART raises difference-awareness accuracy from 39% to 68.8% on benchmarks while cutting harm-drift cases by 72.6% and improving real-world appropriate responses from 39.8% to 77.5%.

citing papers explorer

Showing 1 of 1 citing paper.

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training cs.CL · 2026-04-18 · unverdicted · none · ref 2
DART raises difference-awareness accuracy from 39% to 68.8% on benchmarks while cutting harm-drift cases by 72.6% and improving real-world appropriate responses from 39.8% to 77.5%.

InThe Twelfth International Con- ference on Learning Representations

fields

years

verdicts

representative citing papers

citing papers explorer