Behavioral canaries detect unauthorized RL fine-tuning on private contexts by inducing and measuring trigger-conditioned stylistic preferences, achieving 67% detection at 10% false-positive rate with 1% injection.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning
Behavioral canaries detect unauthorized RL fine-tuning on private contexts by inducing and measuring trigger-conditioned stylistic preferences, achieving 67% detection at 10% false-positive rate with 1% injection.