pith. sign in

Buck Shlegeris

Identifiers

  • name variant Buck Shlegeris 0.60 · backfill

Papers (9)

  1. Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases cs.AI · 2026 · author #3
  2. LinuxArena: A Control Setting for AI Agents in Live Production Software Environments cs.CR · 2026 · author #33
  3. Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025 · author #34
  4. Alignment faking in large language models cs.AI · 2024 · author #18
  5. Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols cs.AI · 2024 · author #3
  6. Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024 · author #11
  7. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training cs.CR · 2024 · author #37
  8. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small cs.LG · 2022 · author #4
  9. Supervising strong learners by amplifying weak experts cs.LG · 2018 · author #2

Mentions

  • 2507.11473 #34 · arxiv_oai · confidence 0.70 Buck Shlegeris
  • 2406.10162 #11 · arxiv_oai · confidence 0.70 Buck Shlegeris

Frequent Coauthors