Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.
C2po: Diagnosing and disentangling bias shortcuts in llms, 2025
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure
Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.