pith. sign in

Joe Benton

Identifiers

  • name variant Joe Benton 0.60 · backfill

Papers (8)

  1. Diffuse AI Control on Fuzzy Tasks cs.LG · 2026 · author #4
  2. Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning cs.LG · 2026 · author #2
  3. SLEIGHT-Bench: A Benchmark of Evasion Attacks Against Agent Monitors cs.CR · 2026 · author #5
  4. Efficiently Aligning Language Models with Online Natural Language Feedback cs.LG · 2026 · author #2
  5. Removing Sandbagging in LLMs by Training with Weak Supervision cs.LG · 2026 · author #4
  6. Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety cs.AI · 2025 · author #5
  7. Reasoning Models Don't Always Say What They Think cs.CL · 2025 · author #2
  8. Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming cs.CL · 2025 · author #13

Mentions

  • 2606.08892 #4 · arxiv_oai · confidence 0.70 Joe Benton
  • 2605.04356 #2 · arxiv_oai · confidence 0.70 Joe Benton
  • 2605.24286 #2 · arxiv_oai · confidence 0.70 Joe Benton
  • 2507.11473 #5 · arxiv_oai · confidence 0.70 Joe Benton
  • 2605.16626 #5 · arxiv_oai · confidence 0.70 Joe Benton
  • 2501.18837 #13 · arxiv_oai · confidence 0.70 Joe Benton

Frequent Coauthors