arxiv: 2605.04070 · v1 · submitted 2026-04-13 · 💻 cs.HC · cs.AI· cs.LG

Recognition: unknown

Toward Human-AI Complementarity Across Diverse Tasks

Yuzheng Xu , Annya Dahmani , Matthew D. Blanchard , Niclas Dern , Edy Nastase , Francesca Bianco , Maja Pavlovic , Sukanya Krishna

show 8 more authors

Eric Modesitt Miranda Anna Christ Arth Singh Gaia Molinaro Sikata Bela Sengupta Jaji Pamarthi Arjun Menon Rishub Jain

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.LG

keywords human-AI complementarityAI oversighthybridizationassistance methodsconfidence calibrationtask accuracymulti-domain evaluationdeception detection

0 comments

The pith

Human-AI combinations provide only a 0.4 percentage point improvement over AI alone on diverse tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether pairing humans with AI can outperform either alone across knowledge, factuality, reasoning, and deception tasks. It shows that basic hybridization adds just 0.4 percentage points because humans rarely correct AI mistakes and the AI's confidence scores do not help pick when to defer to humans. Assistance that gives humans the AI's top suggestions raises their accuracy when the AI is unsure, yet this occurs mostly by following good suggestions instead of fixing errors. The work identifies that success depends on routing decisions correctly to humans and designing aids that let humans catch AI mistakes, rather than on humans being accurate in general. These results highlight concrete barriers to using hybrid teams for overseeing advanced AI.

Core claim

The paper establishes that on a multi-domain dataset of 1,886 samples, baseline hybridization achieves 69.3% accuracy versus 68.9% for AI alone, with the gain constrained by a complementarity region comprising only 8.9% of items and by confidence scores that do not distinguish correct from incorrect AI predictions. Top-2 assistance, applied on low-confidence AI outputs, lifts human accuracy from 28.4% to 38.3%, exceeding AI's 37.7%, primarily through adoption of correct suggestions. The analyses reveal that the key limitations lie in identifying when to involve humans and in creating assistance that supports error detection rather than mere suggestion following.

What carries the argument

The complementarity region, which measures items where AI is wrong but humans are right, and the distribution of the AI model's confidence scores across correct and incorrect cases.

If this is right

The primary bottlenecks for effective human-AI collaboration are routing decisions and the design of assistance methods.
Future work should target improvements in identifying the complementarity region and enabling humans to override AI errors.
Quantitative breakdowns show where each method succeeds or fails, providing targets for refinement.
Releasing the dataset supports further progress on human-AI collaboration for AI oversight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that scaling AI oversight may require developing new techniques for human-AI interaction beyond current hybridization and assistance approaches.
It connects to broader questions of whether human strengths can complement AI in high-stakes domains like deception detection without additional training.
One extension could be testing whether different human participant pools or task designs increase the size of the complementarity region.

Load-bearing premise

The 1,886 samples and chosen human participants adequately represent the realistic tasks and human capabilities needed for oversight of advanced AI systems.

What would settle it

Demonstrating a substantially larger complementarity region or a method where confidence scores differ markedly between correct and incorrect predictions on comparable tasks would challenge the finding that current approaches yield only modest gains.

Figures

Figures reproduced from arXiv: 2605.04070 by Annya Dahmani, Arjun Menon, Arth Singh, Edy Nastase, Eric Modesitt, Francesca Bianco, Gaia Molinaro, Jaji Pamarthi, Maja Pavlovic, Matthew D. Blanchard, Miranda Anna Christ, Niclas Dern, Rishub Jain, Sikata Bela Sengupta, Sukanya Krishna, Yuzheng Xu.

**Figure 1.** Figure 1: Overview of our experimental pipeline. Each item is independently evaluated by AI (majority vote over 20 GPT-5-mini responses OpenAI, 2025) and human raters under three conditions: unassisted, with top-2 assistance, and with subtask delegation. Both AI and human confidence scores are calibrated via isotonic regression and used for confidencebased hybridization routing, which selects the AI or human answer… view at source ↗

**Figure 2.** Figure 2: Baseline human evaluation interface. Participants were shown task-specific [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗

**Figure 3.** Figure 3: Top-2 assistance interface. Participants were shown the task instructions and [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Subtask delegation interface. The AI decomposes the original question into [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Reliability diagrams on the full benchmark. Left: [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Confidence density histograms (correct vs. incorrect) on the full benchmark. [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: ECE (lower is better) and AUROC (higher is better) comparison across all methods [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Per-dataset 2 × 2 agreement matrices defined by majority-vote human correctness and AI correctness. Each cell shows the percentage and count of samples where both Human and AI are correct, only AI is correct, only humans are correct, or both are wrong. that routing technical threat detection to AI and social-manipulation screening to humans could substantially improve accuracy. • Intuitive red flags vs. sy… view at source ↗

**Figure 9.** Figure 9: Per-dataset AI confidence calibration curves. Points show predicted confidence vs. [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Per-dataset AI confidence distributions, split by AI correctness. Green bars [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: Human accuracy on the low-confidence subset ( [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

read the original abstract

Human-AI complementarity, the idea that combining human and AI judgments can outperform either alone, offers a promising pathway toward robust oversight of advanced AI systems. However, whether human-AI complementarity can be achieved on realistic tasks remains an open question. We investigate this through two approaches: hybridization and two AI assistance methods (top-2 assistance and subtask delegation), evaluated on a multi-domain dataset of 1,886 samples spanning knowledge, factuality, long-context reasoning, and deception detection. We find only modest complementarity gains. Baseline hybridization yields just +0.4 percentage points (pp) over AI alone (69.3\% vs 68.9\%), limited both by a small complementarity region (only 8.9\% of items where AI errs but humans do not) and the inability of confidence-based routing to identify it, since the model's confidence is similarly distributed across correct and incorrect predictions. Applied when AI has low confidence, top-2 assistance increases human accuracy from 28.4\% to 38.3\%, surpassing AI alone (37.7\%) -- but primarily because humans adopt correct AI suggestions, not because they successfully override AI errors. These findings suggest that the primary bottleneck is not human task accuracy per se, but the ability to route decisions to humans when it matters and to design assistance methods that enable humans to catch AI mistakes. Our quantitative and qualitative analyses pinpoint where and why each method succeeds or fails, offering concrete targets for future work. We will release our dataset and code upon request to support progress toward more effective human-AI collaboration for AI oversight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Modest +0.4 pp hybridization gain on this 1,886-sample set stems from a small 8.9% complementarity region and non-informative confidence scores, with assistance mostly working via adoption rather than correction.

read the letter

The main thing to know is that baseline human-AI hybridization on their multi-domain dataset only beats the AI by 0.4 percentage points, and the paper pins this on a narrow slice where humans are right and AI wrong plus the fact that model confidence does not separate those cases. Top-2 assistance lifts humans when the model is unsure, but the lift comes mostly from following the AI's correct answer rather than catching its mistakes. Subtask delegation gets similar treatment in the breakdowns. They also supply qualitative notes on failure modes across knowledge, factuality, long-context, and deception tasks. Releasing the dataset and code is a concrete plus for anyone who wants to re-run or extend the measurements. The numbers themselves are direct counts with no hidden fitting, which keeps the claims straightforward. The soft spot is how far these results travel. The tasks and participant pool may not match the error distributions or expertise levels that matter for actual oversight of frontier systems, so the claim that routing and assistance design are the main bottlenecks could be narrower than the abstract suggests. If the humans here are closer to general users than domain experts, the complementarity region might look different under more realistic conditions. This is useful for people working on hybrid evaluation or oversight pipelines who need empirical anchors on current limits. Readers who want concrete targets for improving routing or assistance will get value from the breakdowns. It deserves a serious referee because the measurements are new, the questions are practical, and the limitations are addressable in revision rather than load-bearing. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper investigates human-AI complementarity on realistic tasks for AI oversight using a dataset of 1,886 samples spanning knowledge, factuality, long-context reasoning, and deception detection. Through hybridization and assistance methods (top-2 and subtask delegation), it finds only modest gains from combining human and AI judgments (+0.4 pp over AI alone), attributed to a small complementarity region (8.9%) and ineffective confidence-based routing. Top-2 assistance boosts human accuracy in low-confidence cases but mainly via adopting AI suggestions. The authors conclude that bottlenecks lie in routing decisions to humans and designing assistance to catch AI errors, and will release the dataset and code.

Significance. This empirical study offers concrete measurements and qualitative insights into the challenges of human-AI complementarity, providing specific targets for improving collaboration in AI oversight. The release of data and code is a strength that enhances reproducibility and enables follow-up research. If the modest complementarity holds across broader settings, it shifts focus from human accuracy to better system design for routing and assistance.

major comments (2)

The +0.4 pp improvement (69.3% vs 68.9%) is presented as modest, but without reported statistical tests or confidence intervals in the main results, it is difficult to assess whether this difference is meaningful or could be due to sampling variability.
The claim that the primary bottleneck is not human task accuracy but routing and assistance design is load-bearing for the paper's recommendations. However, this depends on the 1,886 samples and human participants representing realistic oversight scenarios for advanced AI; more details on participant selection, expertise, and task representativeness would strengthen this.

minor comments (2)

The abstract states 'We will release our dataset and code upon request'; consider specifying a timeline or repository to make this commitment more concrete.
Ensure that figures illustrating the complementarity region and confidence distributions are clearly labeled with sample sizes and percentages for easy interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results and strengthen the discussion of our experimental setup. We address each major comment below.

read point-by-point responses

Referee: The +0.4 pp improvement (69.3% vs 68.9%) is presented as modest, but without reported statistical tests or confidence intervals in the main results, it is difficult to assess whether this difference is meaningful or could be due to sampling variability.

Authors: We agree that statistical tests and confidence intervals are important for interpreting the modest gains. In the revised manuscript, we have added 95% bootstrap confidence intervals for all accuracy metrics and applied McNemar's test to compare hybrid and AI-only performance. The test confirms the +0.4 pp difference is not statistically significant (p=0.62), supporting our characterization of the complementarity as limited. These additions appear in the results section and Table 1. revision: yes
Referee: The claim that the primary bottleneck is not human task accuracy but routing and assistance design is load-bearing for the paper's recommendations. However, this depends on the 1,886 samples and human participants representing realistic oversight scenarios for advanced AI; more details on participant selection, expertise, and task representativeness would strengthen this.

Authors: We have expanded the Methods section with additional details on participant recruitment (Prolific platform with screening for English proficiency and basic familiarity with AI tools), self-reported expertise (average 3.2/5 on AI interaction), and task design rationale (selected domains to approximate real-world oversight tasks like fact-checking and long-context analysis). We also added a limitations paragraph noting that our participant pool and tasks, while diverse, may not fully generalize to expert oversight of frontier models. These changes support the bottleneck claim while acknowledging scope constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on fixed dataset

full rationale

The paper is an empirical evaluation of human-AI hybridization and assistance methods (top-2, subtask delegation) on a pre-collected 1,886-sample multi-domain dataset. All reported figures (+0.4 pp gain, 8.9% complementarity region, confidence distributions, accuracy lifts from 28.4% to 38.3%) are direct counts and percentages computed from observed human and model responses. No equations, parameter fitting, derivations, or predictive models are present. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core claims. Results are self-contained direct measurements against external benchmarks (human and AI performance on the items), satisfying the criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical measurements rather than derivations; no free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5649 in / 970 out tokens · 37777 ms · 2026-05-10T15:54:42.899740+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 12 canonical work pages · 1 internal anchor

[1]

2025 , eprint=

Human-AI Complementarity: A Goal for Amplified Oversight , author=. 2025 , eprint=

2025
[2]

Vera and Zhang, Junti and Lee, Yi-Chieh , year=

Li, Jingshu and Yang, Yitian and Liao, Q. Vera and Zhang, Junti and Lee, Yi-Chieh , year=. As Confidence Aligns: Understanding the Effect of AI Confidence on Human Self-confidence in Human-AI Decision Making , url=. doi:10.1145/3706598.3713336 , booktitle=

work page doi:10.1145/3706598.3713336
[3]

Are You Really Sure?

"Are You Really Sure?" Understanding the Effects of Human Self-Confidence Calibration in AI-Assisted Decision Making , author=. 2024 , eprint=

2024
[4]

2022 , eprint=

Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=

2022
[5]

2024 , eprint=

Complementarity in Human-AI Collaboration: Concept, Sources, and Evidence , author=. 2024 , eprint=

2024
[6]

Proceedings of the National Academy of Sciences , volume =

Mark Steyvers and Heliodoro Tejeda and Gavin Kerrigan and Padhraic Smyth , title =. Proceedings of the National Academy of Sciences , volume =. 2022 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2111547119 , abstract =

work page doi:10.1073/pnas.2111547119 2022
[7]

2022 , eprint=

Deciding Fast and Slow: The Role of Cognitive Biases in AI-assisted Decision-making , author=. 2022 , eprint=

2022
[8]

2021 , eprint=

Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance , author=. 2021 , eprint=

2021
[9]

Buçinca, Zana and Malaya, Maja Barbara and Gajos, Krzysztof Z. , year=. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making , volume=. Proceedings of the ACM on Human-Computer Interaction , publisher=. doi:10.1145/3449287 , number=

work page doi:10.1145/3449287
[10]

2023 , eprint=

Explanations Can Reduce Overreliance on AI Systems During Decision-Making , author=. 2023 , eprint=

2023
[11]

Kim, Sunnie S. Y. and Vaughan, Jennifer Wortman and Liao, Q. Vera and Lombrozo, Tania and Russakovsky, Olga , year=. Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies , url=. doi:10.1145/3706598.3714020 , booktitle=

work page doi:10.1145/3706598.3714020
[12]

Proceedings of the 37th International Conference on Machine Learning , pages =

Consistent Estimators for Learning to Defer to an Expert , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

2020
[13]

2020 , eprint=

Learning to Complement Humans , author=. 2020 , eprint=

2020
[14]

Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages =

Lee, Min Hun and Tok, Martyn Zhe Yu , title =. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2025 , isbn =. doi:10.1145/3715275.3732155 , abstract =

work page doi:10.1145/3715275.3732155 2025
[15]

2024 , eprint=

Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling , author=. 2024 , eprint=

2024
[16]

2025 , eprint=

Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models , author=. 2025 , eprint=

2025
[17]

2024 , eprint=

Auditing for Human Expertise , author=. 2024 , eprint=

2024
[18]

2026 , eprint=

Explaining and Improving Information Complementarities in Multi-Agent Decision-making , author=. 2026 , eprint=

2026
[19]

2018 , eprint=

Supervising strong learners by amplifying weak experts , author=. 2018 , eprint=

2018
[20]

2025 , eprint=

Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI , author=. 2025 , eprint=

2025
[21]

2021 , eprint=

Recursively Summarizing Books with Human Feedback , author=. 2021 , eprint=

2021
[22]

2023 , eprint=

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. 2023 , eprint=

2023
[23]

2023 , eprint=

Decomposed Prompting: A Modular Approach for Solving Complex Tasks , author=. 2023 , eprint=

2023
[24]

2025 , eprint=

Recursive Decomposition with Dependencies for Generic Divide-and-Conquer Reasoning , author=. 2025 , eprint=

2025
[25]

2023 , eprint=

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning , author=. 2023 , eprint=

2023
[26]

2025 , eprint=

Learning Task Decomposition to Assist Humans in Competitive Programming , author=. 2025 , eprint=

2025
[27]

Chaudhuri, Swarat and Ellis, Kevin and Polozov, Oleksandr and Singh, Rishabh and Solar-Lezama, Armando and Yue, Yisong , title =. Found. Trends Program. Lang. , month = dec, pages =. 2021 , issue_date =. doi:10.1561/2500000049 , abstract =

work page doi:10.1561/2500000049 2021
[28]

2024 , eprint=

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator , author=. 2024 , eprint=

2024
[29]

2022 , eprint=

Visual Programming: Compositional visual reasoning without training , author=. 2022 , eprint=

2022
[30]

2025 , eprint=

Semi-structured LLM Reasoners Can Be Rigorously Audited , author=. 2025 , eprint=

2025
[31]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=

2024
[32]

2023 , eprint=

PAL: Program-aided Language Models , author=. 2023 , eprint=

2023
[33]

2024 , eprint=

Watch Your Steps: Observable and Modular Chains of Thought , author=. 2024 , eprint=

2024
[34]

2026 , eprint=

Intelligent AI Delegation , author=. 2026 , eprint=

2026
[35]

2018 , eprint=

Scalable agent alignment via reward modeling: a research direction , author=. 2018 , eprint=

2018
[36]

2018 , eprint=

AI safety via debate , author=. 2018 , eprint=

2018
[37]

2025 , eprint=

Scaling Laws For Scalable Oversight , author=. 2025 , eprint=

2025
[38]

2025 , eprint=

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents , author=. 2025 , eprint=

2025
[39]

2024 , eprint=

Evaluating Frontier Models for Dangerous Capabilities , author=. 2024 , eprint=

2024
[40]

Q u ALITY : Question Answering with Long Input Texts, Yes!

Pang, Richard Yuanzhe and Parrish, Alicia and Joshi, Nitish and Nangia, Nikita and Phang, Jason and Chen, Angelica and Padmakumar, Vishakh and Ma, Johnny and Thompson, Jana and He, He and Bowman, Samuel. Q u ALITY : Question Answering with Long Input Texts, Yes!. Proceedings of the 2022 Conference of the North American Chapter of the Association for Compu...

work page doi:10.18653/v1/2022.naacl-main.391 2022
[41]

2023 , eprint=

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. 2023 , eprint=

2023
[42]

Humanity's Last Exam

A benchmark of expert-level academic questions to assess. Nature , author =. 2026 , pages =. doi:10.1038/s41586-025-09962-4 , abstract =

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
[43]

2025 , journal =

Humanity's Last Exam , author =. 2025 , journal =

2025
[44]

Bowman , booktitle =

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle =. 2024 , url =

2024
[45]

2025 , journal =

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge , author =. 2025 , journal =

2025
[46]

2025 , journal =

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality , author =. 2025 , journal =

2025
[47]

and Ng, Lynnette Hui Xian and Carley, Kathleen M

Phillips, Samantha C. and Ng, Lynnette Hui Xian and Carley, Kathleen M. , title =. Companion Proceedings of the Web Conference 2022 , pages =. 2022 , isbn =. doi:10.1145/3487553.3524665 , abstract =

work page doi:10.1145/3487553.3524665 2022
[48]

Web of lies: a tool for determining the limits of verification in preventing the spread of false information on networks , journal =

Makovi, Kinga and Mu. Web of lies: a tool for determining the limits of verification in preventing the spread of false information on networks , journal =. 2021 , volume =. doi:10.1038/s41598-021-82844-7 , url =

work page doi:10.1038/s41598-021-82844-7 2021
[49]

Gorilla Experiment Builder , year =
[50]

Selective Classification for Deep Neural Networks , url =

Geifman, Yonatan and El-Yaniv, Ran , booktitle =. Selective Classification for Deep Neural Networks , url =
[51]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Role of Human-AI Interaction in Selective Prediction , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2022 , month=. doi:10.1609/aaai.v36i5.20465 , abstractNote=

work page doi:10.1609/aaai.v36i5.20465 2022
[52]

Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer , url =

Madras, David and Pitassi, Toni and Zemel, Richard , booktitle =. Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer , url =
[53]

2025 , publisher =

Confidence-Based Trust Calibration in Human-AI Teams , journal =. 2025 , publisher =. doi:10.14569/IJACSA.2025.01612122 , url =

work page doi:10.14569/ijacsa.2025.01612122 2025
[54]

OpenAI , journal=. OpenAI
[55]

2026 , eprint=

Cocoa: Co-Planning and Co-Execution with AI Agents , author=. 2026 , eprint=

2026