pith. sign in

hub

Ai control: Improving safety despite intentional subversion

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

hub tools

citation-role summary

background 2

citation-polarity summary

years

2026 18 2025 1

roles

background 2

polarities

background 2

clear filters

representative citing papers

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

Automated alignment is harder than you think

cs.AI · 2026-05-07 · conditional · novelty 6.0

AI agents automating alignment research are prone to systematic undetected errors in fuzzy tasks, leading to overconfident but flawed safety assessments even without deliberate sabotage.

Detecting Safety Violations Across Many Agent Traces

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Meerkat uses clustering plus agentic search to detect sparse safety violations across many agent traces, outperforming baselines and finding nearly 4x more reward-hacking cases on CyBench.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Online Safety Monitoring for LLMs

cs.AI · 2026-07-02 · unverdicted · novelty 3.0

Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • "Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms cs.AI · 2026-06-10 · unverdicted · none · ref 72

    Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.

  • From Admission to Invariants: Measuring Deviation in Delegated Agent Systems cs.AI · 2026-04-19 · unverdicted · none · ref 9

    The Non-Identifiability Theorem shows admissible behavior space A0 is not identifiable from local enforcement signals g under the Local Observability Assumption, so the paper introduces an Invariant Measurement Layer to detect admission-time drift.

  • Detecting Safety Violations Across Many Agent Traces cs.AI · 2026-04-13 · unverdicted · none · ref 2

    Meerkat uses clustering plus agentic search to detect sparse safety violations across many agent traces, outperforming baselines and finding nearly 4x more reward-hacking cases on CyBench.

  • Online Safety Monitoring for LLMs cs.AI · 2026-07-02 · unverdicted · none · ref 9

    Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.