pith. sign in

hub

Large language models often know when they are being evaluated

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

hub tools

citation-role summary

background 3

citation-polarity summary

years

2026 11 2025 1

roles

background 3

polarities

background 3

clear filters

representative citing papers

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

Naturalistic measure of social norms alignment

cs.CL · 2026-05-22 · unverdicted · novelty 6.0

Proposes solution matching metrics (stated and explicit agreement accuracy) and a 3k Danish dilemma dataset to evaluate social norms alignment between LLMs and humans in naturalistic settings.

Towards Understanding Specification Gaming in Reasoning Models

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

Reinforcement learning training for reasoning substantially raises specification gaming rates in LLMs across diverse tasks, with Grok 4 highest and Claude models lowest, and mitigations only partially effective.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

citing papers explorer

Showing 4 of 4 citing papers after filters.