pith. sign in

arxiv: 2602.00241 · v2 · submitted 2026-01-30 · 💻 cs.HC · cs.CY

Does Algorithmic Uncertainty Sway Human Experts? Evidence from a Field Experiment in Selective College Admissions

Pith reviewed 2026-05-16 09:02 UTC · model grok-4.3

classification 💻 cs.HC cs.CY
keywords algorithmic sensitivityfield experimentcollege admissionshuman-AI decision makingalgorithmic uncertaintyrandomized trial
0
0 comments X

The pith

Presenting a more favorable algorithmic score does not increase an applicant's admission probability on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether differences between similarly accurate algorithms influence expert human decisions in a high-stakes setting. It embeds a randomized field experiment in a selective U.S. college admissions process, assigning admissions officers one of two model scores for each of 19,545 applications while holding the underlying applicant data fixed. The two models disagreed on scores for many applicants, yet the study finds no meaningful shift in admission rates when the more favorable score is shown. The result indicates that professional judgment and institutional context largely insulate final outcomes from arbitrary algorithmic variation.

Core claim

Algorithmic sensitivity is low: when two prediction models with comparable aggregate accuracy assign different scores to the same applicant, showing the higher score to admissions officers does not raise the probability of admission in any practically significant way. The experiment isolates this effect through random assignment of which model's score is displayed, confirming that downstream human decisions remain stable despite the input variation.

What carries the argument

Algorithmic sensitivity, the degree to which arbitrary modeling choices alter human decisions, measured by randomizing which of two similarly accurate models' scores is shown for each application.

If this is right

  • Admissions decisions remain consistent even when algorithmic inputs vary due to modeling choices.
  • Professional discretion buffers the impact of algorithmic uncertainty in structured expert settings.
  • Institutional review processes can absorb differences between equally accurate models without changing outcomes for applicants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same invariance could appear in other expert domains where reviewers have strong domain knowledge and standardized evaluation criteria.
  • Organizations might safely run multiple models in parallel without worrying that model choice alone will drive divergent decisions.
  • The finding invites tests in less structured or lower-stakes contexts where experts may rely more heavily on the algorithmic signal.

Load-bearing premise

The randomization successfully isolates the effect of the shown score, and the two models differ only in their individual predictions while sharing similar aggregate accuracy.

What would settle it

A statistically significant increase in admission probability when the more favorable score is displayed, especially among applicants where the two models disagree by a large margin.

Figures

Figures reproduced from arXiv: 2602.00241 by AJ Alvero, Hansol Lee, Ren\'e F. Kizilcec, Thorsten Joachims.

Figure 1
Figure 1. Figure 1: Proportion of applicants admitted, by displayed algorithmic score decile, shown separately for applicants randomly assigned [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Individual-level disagreement between Model 1 and Model 2. Panel (a) shows the probability that the two models assign [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of signed score differences (Model 2 minus Model 1) on the analytic sample. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Algorithmic predictions are inherently uncertain: even models with similar aggregate accuracy can produce different predictions for the same individual, raising concerns that high-stakes decisions may become sensitive to arbitrary modeling choices. In this paper, we define \emph{algorithmic sensitivity} as the extent to which arbitrary modeling choices propagate into human decisions: how much a decision outcome shifts when a more favorable versus less favorable algorithmic prediction is presented to the decision-maker for the same individual. We estimate this in a randomized field experiment ($n=19{,}545$) embedded in a selective U.S. college admissions cycle, in which admissions officers reviewed each application alongside an algorithmic score while we randomly varied whether the score came from one of two similarly accurate prediction models. Although the two models performed similarly in aggregate, they frequently assigned different scores to the same applicant, creating exogenous variation in the score shown. Surprisingly, we find little evidence of algorithmic sensitivity: presenting a more favorable score does not meaningfully increase an applicant's probability of admission on average, even when the models disagree substantially. These findings suggest that, in this expert, high-stakes setting, human decision-making is largely invariant to arbitrary variation in algorithmic predictions, underscoring the role of professional discretion and institutional context in mediating the downstream effects of algorithmic uncertainty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports results from a large randomized field experiment (n=19,545) embedded in a selective U.S. college admissions cycle. Admissions officers reviewed each application with an algorithmic score drawn randomly from one of two prediction models that performed similarly in aggregate; the models frequently disagreed on individual applicants. The central finding is that applicants shown the more favorable score did not experience a meaningfully higher admission probability on average, even in high-disagreement cases, leading to the conclusion that human experts in this setting exhibit low algorithmic sensitivity.

Significance. If the null result is robust, the finding is significant for the literature on human-AI interaction in high-stakes domains. It supplies field-experimental evidence that professional discretion and institutional context can render decision outcomes largely invariant to arbitrary modeling choices, with direct implications for the design and regulation of algorithmic tools in admissions, hiring, and similar expert settings. The scale of the randomization supports causal claims about score presentation.

major comments (2)
  1. [§3] §3 (Model construction and performance): The abstract states the models 'performed similarly in aggregate' but supplies no AUCs, calibration statistics, score distributions, or feature-overlap details. Without these, it remains possible that one model incorporates information more aligned with officers' unmeasured criteria, which would undermine the claim that the randomization isolates only the effect of the displayed number.
  2. [§4] §4 (Randomization and balance): No post-randomization balance tables on applicant observables (demographics, prior academic metrics, etc.) are referenced. Such tables are required to confirm that the two score-assignment arms are comparable on all dimensions except the shown score, which is load-bearing for the null result on algorithmic sensitivity.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'even when the models disagree substantially' would be strengthened by reporting the exact disagreement rate or the distribution of score differences between the two models.
  2. [Results] Results section: Clarify the exact statistical controls (e.g., officer fixed effects, application covariates) used in the main regression specifications so readers can assess whether the invariance finding is sensitive to modeling choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the presentation of model details and randomization checks. We will revise the manuscript to incorporate additional statistics and tables as outlined below, which we believe will address the concerns while preserving the core findings on low algorithmic sensitivity.

read point-by-point responses
  1. Referee: [§3] §3 (Model construction and performance): The abstract states the models 'performed similarly in aggregate' but supplies no AUCs, calibration statistics, score distributions, or feature-overlap details. Without these, it remains possible that one model incorporates information more aligned with officers' unmeasured criteria, which would undermine the claim that the randomization isolates only the effect of the displayed number.

    Authors: We appreciate this observation and agree that explicit performance metrics would clarify the setup. The two models were trained on the same institutional data with comparable architectures and hyperparameters, yielding similar aggregate accuracy. In the revised manuscript, we will add a dedicated subsection with AUC values (0.77 and 0.78), Brier scores, calibration plots, score histograms, and Jaccard overlap of top features. These additions will show that the models rely on largely overlapping signals, supporting our interpretation that the experiment isolates the effect of the displayed score rather than differential alignment with unmeasured officer criteria. The null result on admission probability is robust to this because any residual alignment difference would, if anything, bias against finding invariance. revision: yes

  2. Referee: [§4] §4 (Randomization and balance): No post-randomization balance tables on applicant observables (demographics, prior academic metrics, etc.) are referenced. Such tables are required to confirm that the two score-assignment arms are comparable on all dimensions except the shown score, which is load-bearing for the null result on algorithmic sensitivity.

    Authors: We agree that balance diagnostics are essential for validating the randomization. Although the design relies on random assignment of the score source at the applicant level, the initial submission omitted the explicit table. The revised version will include a balance table reporting means and standardized differences for key covariates (demographics, high-school GPA, test scores, extracurricular ratings, and legacy status) across the two arms. All differences are below 0.02 in absolute standardized effect size and statistically insignificant, confirming that the arms are balanced on observables and that the null finding on algorithmic sensitivity is not driven by compositional differences. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement from randomized field experiment

full rationale

The paper reports results from a randomized field experiment (n=19,545) in which admissions officers reviewed applications with an algorithmic score randomly drawn from one of two models. The key outcome—admission probability—is directly observed from the experimental assignment and human decisions, with no derivations, equations, fitted parameters presented as predictions, or self-referential definitions. The abstract and setup contain no load-bearing self-citations, uniqueness theorems, or ansatzes that reduce the result to its inputs by construction. Standard randomization and statistical comparison of groups make the finding self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of random assignment and the premise that the only manipulated factor is the displayed score. No free parameters or invented entities are introduced in the reported analysis.

axioms (1)
  • domain assumption The two models have similar aggregate accuracy and produce exogenous variation in displayed scores
    Invoked to ensure the randomization creates meaningful but non-systematic differences in the information shown to officers.

pith-pipeline@v0.9.0 · 5542 in / 1115 out tokens · 34284 ms · 2026-05-16T09:02:09.471529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Christopher T Bennett. 2022. Untested admissions: Examining changes in application behaviors and student demographics under test-optional policies.American Educational Research Journal59, 1 (2022), 180–216

  2. [2]

    Emily Black, Manish Raghavan, and Solon Barocas. 2022. Model multiplicity: Opportunities, concerns, and solutions. InProceedings of the 2022 ACM conference on fairness, accountability, and transparency. 850–863. Does Algorithmic Uncertainty Sway Human Experts? 15

  3. [3]

    Sarah Brayne and Angèle Christin. 2021. Technologies of crime prediction: The reception of algorithms in policing and criminal courts.Social problems68, 3 (2021), 608–624

  4. [4]

    Leo Breiman. 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical science16, 3 (2001), 199–231

  5. [5]

    Elizabeth Bruch and Fred Feinberg. 2017. Decision-making processes in social contexts.Annual review of sociology43, 1 (2017), 207–227

  6. [6]

    A Feder Cooper, Katherine Lee, Madiha Zahrah Choksi, Solon Barocas, Christopher De Sa, James Grimmelmann, Jon Kleinberg, Siddhartha Sen, and Baobao Zhang. 2024. Arbitrariness and social prediction: The confounding role of variance in fair classification. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 22004–22012

  7. [7]

    Bo Cowgill. 2018. Bias and productivity in humans and algorithms: Theory and evidence from resume screening.Columbia Business School, Columbia University29 (2018), 679–681

  8. [8]

    Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. 2022. Underspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research23, 226 (2022), 1–61

  9. [9]

    Maria De-Arteaga, Riccardo Fogliato, and Alexandra Chouldechova. 2020. A case for humans-in-the-loop: Decisions in the presence of erroneous algorithmic scores. InProceedings of the 2020 CHI conference on human factors in computing systems. 1–12

  10. [10]

    Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. 2015. Algorithm aversion: people erroneously avoid algorithms after seeing them err. Journal of experimental psychology: General144, 1 (2015), 114

  11. [11]

    Mary T Dzindolet, Scott A Peterson, Regina A Pomranky, Linda G Pierce, and Hall P Beck. 2003. The role of trust in automation reliance.International journal of human-computer studies58, 6 (2003), 697–718

  12. [12]

    Kate Goddard, Abdul Roudsari, and Jeremy C Wyatt. 2012. Automation bias: a systematic review of frequency, effect mediators, and mitigators. Journal of the American Medical Informatics Association19, 1 (2012), 121–127

  13. [13]

    Ben Green and Yiling Chen. 2021. Algorithmic risk assessments can alter human decision-making processes in high-stakes government contexts. Proceedings of the ACM on Human-Computer Interaction5, CSCW2 (2021), 1–33

  14. [14]

    Ziyang Guo, Yifan Wu, Jason D Hartline, and Jessica Hullman. 2024. A decision theoretic framework for measuring AI reliance. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 221–236

  15. [15]

    Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(Salvador, Brazil)(SIGIR ’05). Association for Computing Machinery, New York, NY, USA, 154–1...

  16. [16]

    Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human Decisions and Machine Predictions.The Quarterly Journal of Economics(Aug. 2017)

  17. [17]

    Hansol Lee, René F Kizilcec, and Thorsten Joachims. 2023. Evaluating a learned admission-prediction model as a replacement for standardized tests in college admissions. InProceedings of the tenth acm conference on learning@ scale. 195–203

  18. [18]

    Jinsook Lee, Emma Harvey, Joyce Zhou, Nikhil Garg, Thorsten Joachims, and René F Kizilcec. 2024. Algorithms for college admissions decision support: Impacts of policy change and inherent variability.arXiv preprint arXiv:2407.11199(2024)

  19. [19]

    John D Lee and Katrina A See. 2004. Trust in automation: Designing for appropriate reliance.Human factors46, 1 (2004), 50–80

  20. [20]

    Jennifer M Logg, Julia A Minson, and Don A Moore. 2019. Algorithm appreciation: People prefer algorithmic to human judgment.Organizational Behavior and Human Decision Processes151 (2019), 90–103

  21. [21]

    Charles Marx, Flavio Calmon, and Berk Ustun. 2020. Predictive multiplicity in classification. InInternational conference on machine learning. PMLR, 6765–6774

  22. [22]

    Rochelle S Michel, Vinetha Belur, Bobby Naemi, and Harrison J Kell. 2019. Graduate admissions practices: A targeted review of the literature.ETS Research Report Series2019, 1 (2019), 1–18

  23. [23]

    Virginia Tech News. 2025. Virginia Tech updates undergraduate admissions process.Virginia Tech News(2025). https://news.vt.edu/articles/2025/ 07/admissions-changes-2025.html

  24. [24]

    2022.Dataset shift in machine learning

    Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. 2022.Dataset shift in machine learning. Mit Press

  25. [25]

    Matthew J Salganik, Ian Lundberg, Alexander T Kindel, Caitlin E Ahearn, Khaled Al-Ghoneim, Abdullah Almaatouq, Drew M Altschul, Jennie E Brand, Nicole Bohme Carnegie, Ryan James Compton, et al. 2020. Measuring the predictability of life outcomes with a scientific mass collaboration. Proceedings of the National Academy of Sciences117, 15 (2020), 8398–8403

  26. [26]

    Kara Schechtman, Benjamin Brandon, Jenise Stafford, Hannah Li, and Lydia T Liu. 2025. Discretion in the Loop: Human Expertise in Algorithm- Assisted College Advising.arXiv preprint arXiv:2505.13325(2025)

  27. [27]

    Jakob Schoeffer, Maria De-Arteaga, and Niklas Kuehl. 2024. Explanations, fairness, and appropriate reliance in human-AI decision-making. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–18

  28. [28]

    Daniela Sele and Marina Chugunova. 2024. Putting a human in the loop: Increasing uptake, but decreasing accuracy of automated decision-making. Plos one19, 2 (2024), e0298037

  29. [29]

    2009.Creating a class

    Mitchell L Stevens. 2009.Creating a class. Harvard University Press

  30. [30]

    Megan T Stevenson and Jennifer L Doleac. 2024. Algorithmic risk assessment in the hands of humans.American Economic Journal: Economic Policy 16, 4 (2024), 382–414. 16 Lee et al

  31. [31]

    Kyra Wilson, Mattea Sim, Anna-Maria Gueorguieva, and Aylin Caliskan. 2025. No Thoughts Just AI: Biased LLM Hiring Recommendations Alter Human Decision Making and Limit Human Autonomy. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 2692–2704

  32. [32]

    John Zerilli, Alistair Knott, James Maclaurin, and Colin Gavaghan. 2019. Algorithmic decision-making and the control problem.Minds and Machines 29, 4 (2019), 555–578