pith. sign in

Ai control: Improving safety despite intentional subversion

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

citation-role summary

background 2

citation-polarity summary

years

2026 9 2025 1

roles

background 2

polarities

background 2

representative citing papers

Automated alignment is harder than you think

cs.AI · 2026-05-07 · conditional · novelty 6.0

AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.

Detecting Safety Violations Across Many Agent Traces

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Meerkat uses clustering plus agentic search to detect sparse safety violations across many agent traces, outperforming baselines and finding nearly 4x more reward-hacking cases on CyBench.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

citing papers explorer

Showing 10 of 10 citing papers.