PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.
A Unified Approach to Interpreting Model Predictions , url =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
CelebA encodes gendered double standards of ageing and beauty that produce hyper-scrutiny of women and categorical exclusion of older men across labels, feature weights, and spatial attention.
citing papers explorer
-
LLM Output Detectability and Task Performance Can be Jointly Optimized
PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.
-
Beyond Performance Disparities: A Three-Level Audit of Representational Harm in CelebA
CelebA encodes gendered double standards of ageing and beauty that produce hyper-scrutiny of women and categorical exclusion of older men across labels, feature weights, and spatial attention.