pith. sign in

arxiv: 2605.28255 · v1 · pith:YAKFXPS5new · submitted 2026-05-27 · 💻 cs.AI · cs.CL· cs.HC

AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

Pith reviewed 2026-06-29 12:11 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HC
keywords human-AI collaborationdelegationadoptiontrustquestion answeringconfirmation biasreliance decisionsover-reliance
0
0 comments X

The pith

Humans in a question-answering game with AI outperform either alone but still miss correct AI suggestions and accept wrong ones due to bias and uncalibrated confidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how people decide to delegate tasks to AI without seeing its output and how they decide to adopt or reject AI suggestions after seeing them. These two reliance patterns are examined together in the same users during a competitive question-answering game that pairs expert humans with multiple AI agents. Collaboration improves overall accuracy compared with solo humans or solo AI, yet humans still miss 3.9 percent of opportunities to use correct AI answers and follow misleading AI answers 1.7 percent of the time. Confirmation bias raises under-reliance to 64.5 percent when an AI suggestion matches a human's initial wrong answer, and reported AI confidence is near chance level on disagreements. The findings point to concrete design changes such as better calibrated confidence scores and evidence-grounded explanations.

Core claim

In 24 matches that produced 387 delegation decisions and 1440 adoption decisions, human-AI teams performed better than either humans or AI working alone, yet humans under-relied on correct AI suggestions in 3.9 percent of opportunities and over-relied on misleading AI suggestions in 1.7 percent of cases. When humans and AI disagreed, reported model confidence performed near chance. Confirmation bias produced markedly higher under-reliance (64.5 percent) precisely when an AI suggestion agreed with a human's initial incorrect answer. Both parties therefore contributed errors, and the study recommends calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine t

What carries the argument

The separation of delegation choice (deciding to let AI act autonomously without seeing its output) from adoption choice (evaluating a visible AI suggestion), measured together for the same users inside a competitive question-answering game.

If this is right

  • Human-AI teams reach higher accuracy than solo humans or solo AI in the same question-answering task.
  • Suboptimal reliance appears in both directions: missed correct AI suggestions and accepted incorrect ones.
  • Confirmation bias raises under-reliance sharply when AI output matches an initial human error.
  • AI confidence scores are near chance level precisely when humans and AI disagree.
  • Calibrated confidence, evidence-based explanations, and trust-refinement tools are proposed as direct remedies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the observed rates persist outside games, user training focused on detecting confirmation bias could measurably improve reliance accuracy.
  • Interfaces that surface disagreement more visibly than agreement might reduce the 64.5 percent under-reliance spike.
  • The same delegation-versus-adoption split could be measured in domains such as medical diagnosis or legal review to check whether the same bias patterns appear.
  • Allowing users to adjust an AI's displayed confidence threshold over repeated interactions offers a testable way to close the performance gap.

Load-bearing premise

The competitive game format with expert humans and the chosen AI agents produces reliance decisions that generalize to other human-AI collaboration settings.

What would settle it

Repeating the same measures of under-reliance and over-reliance in a non-competitive, open-ended question-answering task with non-expert users would directly test whether the reported rates and bias patterns hold outside the game setting.

Figures

Figures reproduced from arXiv: 2605.28255 by Eve Fleisig, Irene Ying, Jordan Boyd-Graber, Maharshi Gor, Tianyi Zhou, Yoo Yeon Sung, Yu Hou.

Figure 1
Figure 1. Figure 1: Our experimental setup has humans working [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our collaboration interface show [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Question difficulty reveals systematic com [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Utilization and discernment rates by round and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Every bonus decision traced from left to [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Explanation features differ in what predicts [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Decision quality analysis by AI agreement status. Top: overall flow from human initial state through decision to outcome. Middle: when AIs agree (consensus), teams switch more readily and achieve higher accuracy. Bottom: when AIs disagree, teams face harder choices and under-reliance increases. Flow widths represent proportion of responses; colors indicate outcome quality. Compare to the aggregate view in … view at source ↗
Figure 9
Figure 9. Figure 9: Relative difficulty distribution across bonus [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Question difficulty by packet for humans [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Full feature-wise prediction accuracy. Left: predicting [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports an observational study of human-AI collaboration in a competitive question-answering game. Across 24 matches pairing 23 expert humans with 16 AI agents, 387 delegation decisions and 1440 adoption decisions were recorded. The central claims are that human-AI teams outperform either humans or AI alone, yet humans exhibit suboptimal reliance patterns: under-reliance on correct AI suggestions in 3.9% of opportunities and over-reliance on incorrect AI suggestions in 1.7% of cases. Additional findings include model confidence near chance level on disagreements and confirmation bias producing 64.5% under-reliance when AI output matches an initial human error. The authors recommend calibrated confidence, evidence-grounded explanations, and trust-refinement mechanisms.

Significance. If the reported reliance rates and bias patterns hold, the work supplies concrete empirical data on when and why humans delegate or adopt AI output in a high-stakes collaborative setting. The design that measures both delegation and adoption decisions from the same participants is a clear strength, as is the scale of observed decisions. The findings could inform interface design for better-calibrated human-AI teams. However, the competitive expert QA context limits immediate generalizability, and the absence of statistical tests on the key percentages weakens the evidential basis for labeling the behavior 'suboptimal.'

major comments (2)
  1. [Abstract] Abstract: the central claim that humans 'make suboptimal collaboration decisions' is supported only by the raw percentages 3.9% and 1.7%; no statistical tests, confidence intervals, or comparison to a normative baseline (e.g., expected error under random or optimal policy) are reported, leaving open whether these rates differ reliably from chance or from optimal play within the game.
  2. [Abstract] Abstract and Discussion: the inference that the observed under- and over-reliance rates demonstrate generally suboptimal human decision-making is load-bearing for the paper's contribution, yet the design is confined to a competitive scoring game with expert participants and 16 specific AI agents; no within-paper comparison or robustness check tests whether the same direction or magnitude of bias appears in non-competitive tasks, with non-experts, or on different question distributions.
minor comments (2)
  1. [Abstract] The abstract states that 'reported model confidence is near chance when humans and AI disagree' but does not indicate how confidence was elicited from the AI agents or whether it was normalized across the 16 models.
  2. [Methods] Methods section (inferred from sample sizes): exclusion criteria, inter-rater reliability for decision coding, and any controls for order or fatigue effects across the 24 matches are not mentioned in the provided abstract, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evidential basis for our claims and the scope of generalizability. We address each major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that humans 'make suboptimal collaboration decisions' is supported only by the raw percentages 3.9% and 1.7%; no statistical tests, confidence intervals, or comparison to a normative baseline (e.g., expected error under random or optimal policy) are reported, leaving open whether these rates differ reliably from chance or from optimal play within the game.

    Authors: We agree that statistical support would strengthen the central claim. In the revised manuscript we will add binomial confidence intervals around the reported 3.9% under-reliance and 1.7% over-reliance rates. We will also include explicit comparisons of these rates against (a) a random policy baseline derived from the observed human and AI accuracies and (b) an optimal policy that always follows the higher-accuracy agent on each question. These additions will clarify whether the observed deviations are statistically distinguishable from chance or optimal behavior within the game. revision: yes

  2. Referee: [Abstract] Abstract and Discussion: the inference that the observed under- and over-reliance rates demonstrate generally suboptimal human decision-making is load-bearing for the paper's contribution, yet the design is confined to a competitive scoring game with expert participants and 16 specific AI agents; no within-paper comparison or robustness check tests whether the same direction or magnitude of bias appears in non-competitive tasks, with non-experts, or on different question distributions.

    Authors: We acknowledge that the study is limited to a competitive expert QA setting with the 16 AI agents used. The manuscript does not assert that the observed bias magnitudes are universal. In revision we will qualify the abstract and discussion to state explicitly that the findings are tied to this high-stakes collaborative game and to recommend future work examining non-competitive tasks, non-expert participants, and varied question distributions. Because the current work is an observational study of existing matches, we cannot add new within-paper robustness experiments without collecting additional data. revision: partial

Circularity Check

0 steps flagged

No circularity: purely observational behavioral study with direct counts

full rationale

The paper is an empirical study reporting delegation and adoption decisions from 24 matches (387 delegation, 1440 adoption decisions). All key figures (3.9% under-reliance, 1.7% over-reliance, 64.5% confirmation bias) are direct tallies or percentages from recorded human choices against AI outputs. No equations, derivations, fitted parameters, or self-citation chains appear in the load-bearing claims. The central results are falsifiable observations from the specific game setup; generalizability is presented as an open question rather than derived. This matches the default non-circular case for observational work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical behavioral study; the central claim rests on standard experimental assumptions in HCI rather than mathematical derivations. No free parameters, new entities, or ad-hoc axioms are introduced beyond the domain assumption that game decisions reflect genuine trust patterns.

pith-pipeline@v0.9.1-grok · 5813 in / 1298 out tokens · 43697 ms · 2026-06-29T12:11:55.058822+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 1 canonical work pages

  1. [1]

    InEmpirical Methods in Natural Language Processing

    Dataset and baselines for sequential open- domain question answering. InEmpirical Methods in Natural Language Processing. Stephen M. Fleming and Hakwan C. Lau. 2014. How to measure metacognition.Frontiers in Human Neu- roscience, 8. Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar,...

  2. [2]

    how do i fool you?

    Automation bias: a systematic review of fre- quency, effect mediators, and mitigators.Journal of the American Medical Informatics Association, 19(1):121–127. Catalina Gomez, Sue Min Cho, Shichang Ke, Chien- Ming Huang, and Mathias Unberath. 2025. Human- AI collaboration is not very collaborative yet: A tax- onomy of interaction patterns in AI-assisted dec...

  3. [3]

    why should i trust you?

    The impact of large language models in fi- nance: Towards trustworthy adoption.The Alan Tur- ing Institute. Laura R. Marusich, Jonathan Z. Bakdash, Yan Zhou, and Murat Kantarcioglu. 2024. Using ai uncertainty quantification to improve human decision-making. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org. Raja Par...

  4. [4]

    The clue mentions X

    Effect of confidence and explanation on accu- racy and trust calibration in ai-assisted decision mak- ing. InProceedings of the 2020 conference on fair- ness, accountability, and transparency, pages 295– 305. A Additional Related Work This section outlines additional related work not featured in the main text of the paper. A.1 Human-AI Collaboration and A...

  5. [5]

    One shared signal:Only evidential_grounding appears in both top-10 lists, suggesting humans recognize quality when reasoning explicitly cites evidence

  6. [6]

    LLM features predict correctness: question_comprehension (76%), evidential_grounding (74%), domain_expertise_display (74%), and reasoning_coherence (72%) strongly predict whether theAIis correct

  7. [7]

    Surface features predict hu- man trust: has_quotes (70%), semantic_similarity (66%), and word_overlap_ratio (63%) predict human selection but are weak correctness signals

  8. [8]

    quizbowlAIagent

    Implication:AIexplanations should make reasoning explicit through evidence citation. Humans should evaluate reasoning quality rather than surface familiarity. F.5 Implementation Details Features are extracted using a modular pipeline with seven extractors: • SurfaceLinguisticExtractor: Uses spaCy for tokenization and POS tagging; textstat for readability ...

  9. [9]

    All personal and identifiable information has been explicitly anonymized before release

    Dataset:All tournament questions (tossup and bonus), human responses at each stage (initial guess, final answer),AIresponses (answers, confidence scores, explanations), muting decisions, answer correctness la- bels, and anonymized player metadata (experience level, team assignment). All personal and identifiable information has been explicitly anonymized ...

  10. [10]

    Available at https://github.com/qanta-org/ qb-tournament-runner

    Tournament application:Source code for the web application, including the buzzer integration, bonus collaboration interface, moderator dashboard, and draft management system. Available at https://github.com/qanta-org/ qb-tournament-runner

  11. [11]

    Available at https://github.com/qanta-org/ qanta25-analysis

    Analysis scripts:All scripts used for feature extraction, statistical testing, figure generation, and the logistic regression analyses reported in this paper. Available at https://github.com/qanta-org/ qanta25-analysis. System Models Used Calls Approach System 1 GPT-4o-mini (answer) 1 Single-shotlightweight model with strict JSON-only output for- mat. Sys...