{"paper":{"title":"Fair and Calibrated Toxicity Detection with Robust Training and Abstention","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Toxicity detectors hide calibration unfairness across identity subgroups despite near-perfect overall scores.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Mokshit Surana","submitted_at":"2026-05-13T19:50:35Z","abstract_excerpt":"Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter. We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs ($n ="},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration (0.013) but is significantly miscalibrated across all identity subgroups (+0.029 to +0.134). Training interventions reshape rather than eliminate disparity, and abstention itself is unfair.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the chosen subgroup definitions, metrics (subgroup AUC, BPSN/BNSP AUC, ECE), and bootstrap CIs fully capture real-world fairness harms and that post-hoc methods can be evaluated independently of training choices.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Training interventions reshape rather than eliminate calibration and abstention disparities in toxicity detection, requiring a multi-axis fairness framework.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Toxicity detectors hide calibration unfairness across identity subgroups despite near-perfect overall scores.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"d5dbafe0cf106450938bf58f5b8e6ed37c21bb28e07fb256dd3d2486e4f4ed8e"},"source":{"id":"2605.14074","kind":"arxiv","version":1},"verdict":{"id":"489b6283-612b-4940-97b4-9db6692fba1e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T05:06:16.459603Z","strongest_claim":"Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration (0.013) but is significantly miscalibrated across all identity subgroups (+0.029 to +0.134). Training interventions reshape rather than eliminate disparity, and abstention itself is unfair.","one_line_summary":"Training interventions reshape rather than eliminate calibration and abstention disparities in toxicity detection, requiring a multi-axis fairness framework.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the chosen subgroup definitions, metrics (subgroup AUC, BPSN/BNSP AUC, ECE), and bootstrap CIs fully capture real-world fairness harms and that post-hoc methods can be evaluated independently of training choices.","pith_extraction_headline":"Toxicity detectors hide calibration unfairness across identity subgroups despite near-perfect overall scores."},"references":{"count":11,"sample":[{"doi":"","year":2019,"title":"Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L. (2019). Nuanced metrics for measuring unintended bias with real data for text classification.WWW Companion","work_id":"3caf62a1-4189-425f-a7df-ea89c0d1365a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. (2018). Measuring and mitigating unintended bias in text classification.AAAI/ACM AIES","work_id":"55b0a4ff-f708-47f0-83c5-a5d676db6a39","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Geifman, Y ., and El-Yaniv, R. (2017). Selective classification for deep neural networks.NeurIPS 2017","work_id":"021ef3f4-3004-4ba0-bbdb-d05a03b7acbc","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Guo, C., Pleiss, G., Sun, Y ., and Weinberger, K. Q. (2017). On calibration of modern neural networks.ICML 2017","work_id":"c1da11f2-7c41-4a02-9ee0-58c0967b7a7f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Y ., Arjovsky, M., Pezeshki, M., and Lopez-Paz, D","work_id":"2430ab6f-f9b0-43ac-a6ce-15eba2f4d7c3","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":11,"snapshot_sha256":"599498bc6f961f09f584cce49721295015273607d73c97fee8ca86f11f643391","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"1f8a2ee6f87ae8a710e85312f7b8c0cfe251e854650c344287e394fd91e1ad8a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}