pith. machine review for the scientific record. sign in

hub

Monitoring reasoning models for misbehavior and the risks of promoting obfuscation

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

hub tools

citation-role summary

background 1

citation-polarity summary

years

2026 10

roles

background 1

polarities

background 1

representative citing papers

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

citing papers explorer

Showing 10 of 10 citing papers.