signal": Exactly one of:

"signal": Exactly one of: "attack successful", "query again", or "defender detected"

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

RL-trained AI double agents using combined ToM and fooling rewards outperform prompted frontier models on a new belief-steering task and show bidirectional emergence between the two skills.

citing papers explorer

Showing 1 of 1 citing paper.

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind cs.CL · 2026-04-13 · unverdicted · none · ref 12
RL-trained AI double agents using combined ToM and fooling rewards outperform prompted frontier models on a new belief-steering task and show bidirectional emergence between the two skills.

signal": Exactly one of:

fields

years

verdicts

representative citing papers

citing papers explorer