Confronting Label Indeterminacy in Automated Bail Decisions
Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3
The pith
Handling unknown outcomes for denied bail cases changes machine learning models' predictions more than the choice of model itself.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the Pennsylvania bail dataset, each of the five approaches to label indeterminacy rests on unverifiable assumptions yet produces distinct changes in model predictions and internal reasoning across three machine learning models. The novel imputation method imputes unobserved outcomes for denied cases according to the actual dynamics of bail decisions. Explainable AI techniques show that these label choices affect which features drive the models' outputs. The work further assesses the legal legitimacy of relying on such methods when building automated support for bail decisions.
What carries the argument
Label indeterminacy from unobserved counterfactual outcomes in denied bail cases, resolved through five handling methods including a dynamics-based imputation technique.
If this is right
- Models will assign different risk scores to the same defendants depending on which label method is used.
- Explainable AI outputs will highlight different factors as important under different label assumptions.
- The strength of influence from label methods can exceed the differences between the three machine learning models.
- Automated systems may create or reinforce different feedback loops in bail decisions based on the chosen method.
- Legal evaluation is required to determine which unverifiable assumptions are acceptable for judicial support tools.
Where Pith is reading between the lines
- Similar label indeterminacy is likely to appear in other selective labeling settings such as parole or child welfare risk tools.
- Routine sensitivity testing across label methods should become part of validation for any model trained on decision-generated data.
- Documentation of the chosen label handling approach may be needed for any public deployment of such decision support systems.
Load-bearing premise
That the five label handling methods can be isolated and directly compared in the Pennsylvania dataset without other data characteristics or modeling choices confounding their effects on predictions and internal processes.
What would settle it
Re-running the three models on the same Pennsylvania data but with all five label methods producing identical predictions and identical XAI explanations would falsify the claim that the methods influence behavior.
Figures
read the original abstract
Bail decisions present a fundamental challenge for data-driven decision support systems. When bail is denied, the counterfactual outcome of whether the defendant would have appeared in court remains unobserved. As a result, historical bail data embed structural label indeterminacy: future decisions are influenced by past decisions whose outcomes are only partially knowable. Building automated systems on such data risks introducing bias and reinforcing feedback loops. This raises a core question for machine-learning systems intended to assist judicial actors: how should cases in which bail was denied be treated during model development? In a case study of bail decisions from the Unified Judicial System of Pennsylvania, we evaluate five contemporary approaches to handling label indeterminacy across three machine learning models, including a novel label imputation method motivated by the dynamics of bail decisions. Each method relies on unverifiable assumptions, yet all influence the models' predictive behaviour, sometimes even more so than the choice of model itself. Explainable AI analysis further reveals that these effects extend to the models' internal decision-making processes as well. Finally, we consider the notion of label indeterminacy from a legal perspective and assess the legitimacy of these approaches in the context of bail decision-making.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript addresses label indeterminacy in bail decision datasets, where outcomes for denied bail cases remain unobserved counterfactuals. In a case study using Pennsylvania Unified Judicial System records, it evaluates five approaches to handling this indeterminacy—including a novel imputation method motivated by bail decision dynamics—across three machine learning models. The central claim is that each method rests on unverifiable assumptions yet substantially influences predictive behavior and XAI-derived internal decision processes, often more than the choice of model itself; the work concludes with a legal analysis of the legitimacy of these approaches for judicial decision support.
Significance. If the empirical isolation of label-handling effects holds, the work is significant for ML applications in high-stakes legal domains. It provides a concrete demonstration of how structural missingness can propagate into models and explanations, introduces a domain-motivated imputation technique, and integrates legal perspectives on feedback loops and legitimacy. These elements could inform more robust practices for deploying predictive systems in bail and similar settings where partial observability is inherent.
major comments (2)
- [Case Study / Empirical Evaluation] The central claim that label-handling method influences predictive behavior and XAI processes more than model choice requires that differences be attributable to the handling of unobserved counterfactuals rather than correlated factors. The experimental setup does not report an ablation holding feature pipelines, cross-validation folds, and hyperparameter tuning budgets fixed across the five methods, leaving the dominance result vulnerable to confounding by preprocessing or split strategies in the Pennsylvania data.
- [Results and Discussion] The soundness assessment notes the absence of quantitative results, error bars, data-split details, model specifications, or statistical tests in the reported evaluation. Without these, it is difficult to assess the magnitude and reliability of the reported influences on predictions and explanations.
minor comments (3)
- [Methods] Clarify the exact definitions and implementation details of the five contemporary approaches (including the novel imputation) to allow replication and direct comparison.
- [Abstract and Methods] The abstract states that all methods rely on unverifiable assumptions; the main text should explicitly list these assumptions for each method alongside any sensitivity analyses performed.
- [Explainable AI Analysis] Ensure that XAI analysis sections report consistent metrics across models and methods so that claims about effects on internal decision processes can be directly compared.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We agree that strengthening the empirical evaluation and providing more quantitative details will improve the clarity and robustness of our findings. Below, we respond to each major comment and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Case Study / Empirical Evaluation] The central claim that label-handling method influences predictive behavior and XAI processes more than model choice requires that differences be attributable to the handling of unobserved counterfactuals rather than correlated factors. The experimental setup does not report an ablation holding feature pipelines, cross-validation folds, and hyperparameter tuning budgets fixed across the five methods, leaving the dominance result vulnerable to confounding by preprocessing or split strategies in the Pennsylvania data.
Authors: We agree that the experimental setup should explicitly isolate the effects of label-handling methods through an ablation study with fixed feature pipelines, cross-validation folds, and hyperparameter tuning. We will add this ablation to the manuscript, with results demonstrating the dominance of label-handling effects, to eliminate potential confounding. revision: yes
-
Referee: [Results and Discussion] The soundness assessment notes the absence of quantitative results, error bars, data-split details, model specifications, or statistical tests in the reported evaluation. Without these, it is difficult to assess the magnitude and reliability of the reported influences on predictions and explanations.
Authors: We acknowledge the validity of this observation. The current manuscript emphasizes qualitative comparisons and the legal implications, but to enhance the soundness of the empirical claims, we will revise the Results and Discussion sections to include detailed quantitative metrics (e.g., AUC, F1 scores), error bars from multiple runs or cross-validation, explicit data-split information, full model specifications and hyperparameters, and statistical tests (such as ANOVA or paired tests) to evaluate the significance of differences between label-handling methods versus model choices. revision: yes
Circularity Check
No circularity: empirical case study on external data
full rationale
The paper is an empirical case study evaluating five methods for label indeterminacy on Pennsylvania bail records. The central observations (method effects on predictive behavior and XAI processes) are obtained by training models on the dataset and comparing outputs; no derivation chain reduces any result to its inputs by construction, no parameters are fitted then relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked. The analysis remains self-contained against the external court data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five approaches to label indeterminacy rest on unverifiable assumptions about the missing counterfactual outcomes.
Reference graph
Works this paper leans on
-
[1]
D. Abu Elyounes. 2020. Bail or Jail? Judicial versus Algorithmic Decision-Making in the Pretrial System.Science and Technology Law Review21, 2 (2020), 376–445
work page 2020
-
[2]
K. Bauer and A. Gill. 2024. Mirror, Mirror on the Wall: Algorithmic Assessments, Transparency, and Self-Fulfilling Prophecies.Information Systems Research35, 1 (2024), 226–248
work page 2024
-
[3]
A. Fine, E. R. Berthelot, and S. Marsh. 2025. Public Perceptions of Judges’ Use of AI Tools in Courtroom Decision-Making: An Examination of Legitimacy, Fairness, Trust, and Procedural Justice.Behavioral Sciences15, 4 (2025), 476
work page 2025
-
[4]
A. W. Flores, K. Bechtel, and C. T. Lowenkamp. 2016. False positives, false negatives, and false analyses: A rejoinder to machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks.Fed. Probation80 (2016), 38
work page 2016
-
[5]
T. Han. 2021.Recidivism Forecasting Using XGBoost. Technical Report NCJ 305033. U.S. Department of Justice, National Institute of Justice, Washington, D.C. https://nij.ojp.gov/library/publications/recidivism-forecasting-using-xgboost
work page 2021
-
[6]
B. A. Hansard and J. Zhou. 2025. Jurisprudence and the Intelligible World: Exploring Predictive Modelling as a Mechanism to Decide Bail in the Australian Context.International Annals of Criminology63, 3 (2025), 456–492. Confronting Label Indeterminacy in Automated Bail Decisions
work page 2025
-
[7]
K. Imai, Z. Jiang, D. J. Greiner, R. Halen, and S. Shin. 2023. Experimental Evalu- ation of Algorithm-Assisted Human Decision-Making: Application to Pretrial Public Safety Assessment.Journal of the Royal Statistical Society: Series A (Statis- tics in Society)186, 2 (2023), 167–189
work page 2023
-
[8]
T. Kavzoglu and A. Teke. 2022. Predictive Performances of Ensemble Machine Learning Algorithms in Landslide Susceptibility Mapping Using Random Forest, Extreme Gradient Boosting (XGBoost) and Natural Gradient Boosting (NGBoost). Arabian Journal for Science and Engineering47, 6 (01 Jun 2022), 7367–7385
work page 2022
-
[9]
J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, and S. Mullainathan. 2017. Human Decisions and Machine Decisions.Q J Econ133, 1 (Aug. 2017), 237–293
work page 2017
-
[10]
Statistical de- cision theory with counterfactual loss.arXiv preprint arXiv:2505.08908, 2025
B. Koch and Imaim K. 2025. Statistical Decision Theory with Counterfactual Loss. arXiv:2505.08908 [math.ST]
-
[11]
J. L. Koepke and D. G. Robinson. 2018. Danger ahead: Risk assessment and the future of bail reform.Washington Law Review93 (2018), 1725
work page 2018
-
[12]
G. Lima, N. Grgić-Hlača, and M. Cha. 2021. Human Perceptions on Moral Responsibility of AI: A Case Study in AI-Assisted Bail Decision-Making. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan)(CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 235, 17 pages
work page 2021
-
[13]
C. McKay. 2020. Predicting Risk in Criminal Procedure: Actuarial Tools, Algo- rithms, AI and Judicial Decision-Making.Current Issues in Criminal Justice32, 1 (2020), 22–39
work page 2020
-
[14]
A. Mishler, E. H. Kennedy, and A. Chouldechova. 2021. Fairness in Risk As- sessment Instruments: Post-Processing to Achieve Counterfactual Equalized Odds. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(Virtual Event, Canada)(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 386–400
work page 2021
-
[15]
A. Morin-Martel. 2024. Machine learning in bail decisions and judges’ trustwor- thiness.AI & SOCIETY39, 4 (01 Aug 2024), 2033–2044
work page 2024
- [16]
-
[17]
J. Schoeffer, M. De-Arteaga, and J. Elmer. 2025. Perils of Label Indeterminacy: A Case Study on Prediction of Neurological Recovery After Cardiac Arrest. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Trans- parency (FAccT ’25). Association for Computing Machinery, New York, NY, USA, 1080–1094
work page 2025
-
[18]
P. K. Srivastava, U. Raj, P. Patel, S. K. Nigam, N. Shallum, and A. Bhattacharya
- [19]
-
[20]
C. Steging and T. Zbiegień. 2025. Label Indeterminacy in AI & Law. InLegal Knowledge and Information Systems - JURIX 2025: The Thirty-eighth Annual Conference on Legal Knowledge and Information Systems (Frontiers in Artificial Intelligence and Applications, Vol. 416), R. Markovich, L. Di Caro, A. Rapp, and C. Schifanella (Eds.). IOS Press, Turin, Italy, 364–370
work page 2025
-
[21]
A. Stein. 2008. On the Epistemic Authority of Courts.Episteme: A Journal of Social Epistemology5, 3 (2008), 402–410
work page 2008
- [22]
-
[23]
J. Williams and J. Z. Kolter. 2021. A Bayesian Model of Cash Bail Decisions. InPro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 827–837
work page 2021
-
[24]
E. Zacharia, W. Castelino, A. Puthran, and J. Mittal. 2025. Legal Insight - AI-Driven Bail Prediction System.International Journal For Multidisciplinary Research7 (08 2025)
work page 2025
-
[25]
A. Završnik. 2020. Criminal Justice, Artificial Intelligence Systems, and Human Rights.ERA Forum20 (2020), 567–583. Appendix A Model hyperparameters The full parameters of each model used in this study can be seen in Table 4. We include these for transparency and reproducibility reasons. Most settings were kept at default, or were inspired by previous res...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.