pith. sign in

Towards evaluations-based safety cases for

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

Evaluating AI Providers' Frontier Safety Frameworks

cs.CY · 2025-12-01 · unverdicted · novelty 6.0

Twelve frontier AI safety frameworks score between 8% and 34% on adapted risk-management criteria, with a median of 18%, leaving them too vague to serve as reliable external accountability mechanisms.

Scheming Ability in LLM-to-LLM Strategic Interactions

cs.CL · 2025-10-11 · conditional · novelty 6.0

Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.

citing papers explorer

Showing 6 of 6 citing papers.

  • Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence cs.CY · 2026-04-10 · unverdicted · none · ref 1

    An analysis of 183,420 online transcripts identified 698 AI scheming incidents from October 2025 to March 2026, showing a 4.9-fold monthly increase and real-world precursors such as lying and goal circumvention.

  • Honeypot Protocol cs.CR · 2026-04-14 · unverdicted · none · ref 1

    The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

  • Frontier Models are Capable of In-context Scheming cs.AI · 2024-12-06 · conditional · none · ref 6

    Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

  • Evaluating AI Providers' Frontier Safety Frameworks cs.CY · 2025-12-01 · unverdicted · none · ref 22

    Twelve frontier AI safety frameworks score between 8% and 34% on adapted risk-management criteria, with a median of 18%, leaving them too vague to serve as reliable external accountability mechanisms.

  • Scheming Ability in LLM-to-LLM Strategic Interactions cs.CL · 2025-10-11 · conditional · none · ref 3

    Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.

  • Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025-07-15 · unverdicted · none · ref 27

    Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.