Reward Auditor is a statistical framework that audits reward models for suitability by testing for significant degradation in preference perception confidence distributions under real-world perturbations.
A statistically significant result (i.e., a small p-value) implies that we can reject the null hypothesis that the data follows a normal distribution
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios
Reward Auditor is a statistical framework that audits reward models for suitability by testing for significant degradation in preference perception confidence distributions under real-world perturbations.