Recognition: unknown
Toward Human-AI Complementarity Across Diverse Tasks
Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3
The pith
Human-AI combinations provide only a 0.4 percentage point improvement over AI alone on diverse tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that on a multi-domain dataset of 1,886 samples, baseline hybridization achieves 69.3% accuracy versus 68.9% for AI alone, with the gain constrained by a complementarity region comprising only 8.9% of items and by confidence scores that do not distinguish correct from incorrect AI predictions. Top-2 assistance, applied on low-confidence AI outputs, lifts human accuracy from 28.4% to 38.3%, exceeding AI's 37.7%, primarily through adoption of correct suggestions. The analyses reveal that the key limitations lie in identifying when to involve humans and in creating assistance that supports error detection rather than mere suggestion following.
What carries the argument
The complementarity region, which measures items where AI is wrong but humans are right, and the distribution of the AI model's confidence scores across correct and incorrect cases.
If this is right
- The primary bottlenecks for effective human-AI collaboration are routing decisions and the design of assistance methods.
- Future work should target improvements in identifying the complementarity region and enabling humans to override AI errors.
- Quantitative breakdowns show where each method succeeds or fails, providing targets for refinement.
- Releasing the dataset supports further progress on human-AI collaboration for AI oversight.
Where Pith is reading between the lines
- This suggests that scaling AI oversight may require developing new techniques for human-AI interaction beyond current hybridization and assistance approaches.
- It connects to broader questions of whether human strengths can complement AI in high-stakes domains like deception detection without additional training.
- One extension could be testing whether different human participant pools or task designs increase the size of the complementarity region.
Load-bearing premise
The 1,886 samples and chosen human participants adequately represent the realistic tasks and human capabilities needed for oversight of advanced AI systems.
What would settle it
Demonstrating a substantially larger complementarity region or a method where confidence scores differ markedly between correct and incorrect predictions on comparable tasks would challenge the finding that current approaches yield only modest gains.
Figures
read the original abstract
Human-AI complementarity, the idea that combining human and AI judgments can outperform either alone, offers a promising pathway toward robust oversight of advanced AI systems. However, whether human-AI complementarity can be achieved on realistic tasks remains an open question. We investigate this through two approaches: hybridization and two AI assistance methods (top-2 assistance and subtask delegation), evaluated on a multi-domain dataset of 1,886 samples spanning knowledge, factuality, long-context reasoning, and deception detection. We find only modest complementarity gains. Baseline hybridization yields just +0.4 percentage points (pp) over AI alone (69.3\% vs 68.9\%), limited both by a small complementarity region (only 8.9\% of items where AI errs but humans do not) and the inability of confidence-based routing to identify it, since the model's confidence is similarly distributed across correct and incorrect predictions. Applied when AI has low confidence, top-2 assistance increases human accuracy from 28.4\% to 38.3\%, surpassing AI alone (37.7\%) -- but primarily because humans adopt correct AI suggestions, not because they successfully override AI errors. These findings suggest that the primary bottleneck is not human task accuracy per se, but the ability to route decisions to humans when it matters and to design assistance methods that enable humans to catch AI mistakes. Our quantitative and qualitative analyses pinpoint where and why each method succeeds or fails, offering concrete targets for future work. We will release our dataset and code upon request to support progress toward more effective human-AI collaboration for AI oversight.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates human-AI complementarity on realistic tasks for AI oversight using a dataset of 1,886 samples spanning knowledge, factuality, long-context reasoning, and deception detection. Through hybridization and assistance methods (top-2 and subtask delegation), it finds only modest gains from combining human and AI judgments (+0.4 pp over AI alone), attributed to a small complementarity region (8.9%) and ineffective confidence-based routing. Top-2 assistance boosts human accuracy in low-confidence cases but mainly via adopting AI suggestions. The authors conclude that bottlenecks lie in routing decisions to humans and designing assistance to catch AI errors, and will release the dataset and code.
Significance. This empirical study offers concrete measurements and qualitative insights into the challenges of human-AI complementarity, providing specific targets for improving collaboration in AI oversight. The release of data and code is a strength that enhances reproducibility and enables follow-up research. If the modest complementarity holds across broader settings, it shifts focus from human accuracy to better system design for routing and assistance.
major comments (2)
- The +0.4 pp improvement (69.3% vs 68.9%) is presented as modest, but without reported statistical tests or confidence intervals in the main results, it is difficult to assess whether this difference is meaningful or could be due to sampling variability.
- The claim that the primary bottleneck is not human task accuracy but routing and assistance design is load-bearing for the paper's recommendations. However, this depends on the 1,886 samples and human participants representing realistic oversight scenarios for advanced AI; more details on participant selection, expertise, and task representativeness would strengthen this.
minor comments (2)
- The abstract states 'We will release our dataset and code upon request'; consider specifying a timeline or repository to make this commitment more concrete.
- Ensure that figures illustrating the complementarity region and confidence distributions are clearly labeled with sample sizes and percentages for easy interpretation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our results and strengthen the discussion of our experimental setup. We address each major comment below.
read point-by-point responses
-
Referee: The +0.4 pp improvement (69.3% vs 68.9%) is presented as modest, but without reported statistical tests or confidence intervals in the main results, it is difficult to assess whether this difference is meaningful or could be due to sampling variability.
Authors: We agree that statistical tests and confidence intervals are important for interpreting the modest gains. In the revised manuscript, we have added 95% bootstrap confidence intervals for all accuracy metrics and applied McNemar's test to compare hybrid and AI-only performance. The test confirms the +0.4 pp difference is not statistically significant (p=0.62), supporting our characterization of the complementarity as limited. These additions appear in the results section and Table 1. revision: yes
-
Referee: The claim that the primary bottleneck is not human task accuracy but routing and assistance design is load-bearing for the paper's recommendations. However, this depends on the 1,886 samples and human participants representing realistic oversight scenarios for advanced AI; more details on participant selection, expertise, and task representativeness would strengthen this.
Authors: We have expanded the Methods section with additional details on participant recruitment (Prolific platform with screening for English proficiency and basic familiarity with AI tools), self-reported expertise (average 3.2/5 on AI interaction), and task design rationale (selected domains to approximate real-world oversight tasks like fact-checking and long-context analysis). We also added a limitations paragraph noting that our participant pool and tasks, while diverse, may not fully generalize to expert oversight of frontier models. These changes support the bottleneck claim while acknowledging scope constraints. revision: yes
Circularity Check
No circularity: purely empirical measurements on fixed dataset
full rationale
The paper is an empirical evaluation of human-AI hybridization and assistance methods (top-2, subtask delegation) on a pre-collected 1,886-sample multi-domain dataset. All reported figures (+0.4 pp gain, 8.9% complementarity region, confidence distributions, accuracy lifts from 28.4% to 38.3%) are direct counts and percentages computed from observed human and model responses. No equations, parameter fitting, derivations, or predictive models are present. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core claims. Results are self-contained direct measurements against external benchmarks (human and AI performance on the items), satisfying the criterion for score 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2025 , eprint=
Human-AI Complementarity: A Goal for Amplified Oversight , author=. 2025 , eprint=
2025
-
[2]
Vera and Zhang, Junti and Lee, Yi-Chieh , year=
Li, Jingshu and Yang, Yitian and Liao, Q. Vera and Zhang, Junti and Lee, Yi-Chieh , year=. As Confidence Aligns: Understanding the Effect of AI Confidence on Human Self-confidence in Human-AI Decision Making , url=. doi:10.1145/3706598.3713336 , booktitle=
-
[3]
Are You Really Sure?
"Are You Really Sure?" Understanding the Effects of Human Self-Confidence Calibration in AI-Assisted Decision Making , author=. 2024 , eprint=
2024
-
[4]
2022 , eprint=
Measuring Progress on Scalable Oversight for Large Language Models , author=. 2022 , eprint=
2022
-
[5]
2024 , eprint=
Complementarity in Human-AI Collaboration: Concept, Sources, and Evidence , author=. 2024 , eprint=
2024
-
[6]
Proceedings of the National Academy of Sciences , volume =
Mark Steyvers and Heliodoro Tejeda and Gavin Kerrigan and Padhraic Smyth , title =. Proceedings of the National Academy of Sciences , volume =. 2022 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2111547119 , abstract =
-
[7]
2022 , eprint=
Deciding Fast and Slow: The Role of Cognitive Biases in AI-assisted Decision-making , author=. 2022 , eprint=
2022
-
[8]
2021 , eprint=
Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance , author=. 2021 , eprint=
2021
-
[9]
Buçinca, Zana and Malaya, Maja Barbara and Gajos, Krzysztof Z. , year=. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making , volume=. Proceedings of the ACM on Human-Computer Interaction , publisher=. doi:10.1145/3449287 , number=
-
[10]
2023 , eprint=
Explanations Can Reduce Overreliance on AI Systems During Decision-Making , author=. 2023 , eprint=
2023
-
[11]
Kim, Sunnie S. Y. and Vaughan, Jennifer Wortman and Liao, Q. Vera and Lombrozo, Tania and Russakovsky, Olga , year=. Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies , url=. doi:10.1145/3706598.3714020 , booktitle=
-
[12]
Proceedings of the 37th International Conference on Machine Learning , pages =
Consistent Estimators for Learning to Defer to an Expert , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =
2020
-
[13]
2020 , eprint=
Learning to Complement Humans , author=. 2020 , eprint=
2020
-
[14]
Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages =
Lee, Min Hun and Tok, Martyn Zhe Yu , title =. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2025 , isbn =. doi:10.1145/3715275.3732155 , abstract =
-
[15]
2024 , eprint=
Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling , author=. 2024 , eprint=
2024
-
[16]
2025 , eprint=
Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models , author=. 2025 , eprint=
2025
-
[17]
2024 , eprint=
Auditing for Human Expertise , author=. 2024 , eprint=
2024
-
[18]
2026 , eprint=
Explaining and Improving Information Complementarities in Multi-Agent Decision-making , author=. 2026 , eprint=
2026
-
[19]
2018 , eprint=
Supervising strong learners by amplifying weak experts , author=. 2018 , eprint=
2018
-
[20]
2025 , eprint=
Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI , author=. 2025 , eprint=
2025
-
[21]
2021 , eprint=
Recursively Summarizing Books with Human Feedback , author=. 2021 , eprint=
2021
-
[22]
2023 , eprint=
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. 2023 , eprint=
2023
-
[23]
2023 , eprint=
Decomposed Prompting: A Modular Approach for Solving Complex Tasks , author=. 2023 , eprint=
2023
-
[24]
2025 , eprint=
Recursive Decomposition with Dependencies for Generic Divide-and-Conquer Reasoning , author=. 2025 , eprint=
2025
-
[25]
2023 , eprint=
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning , author=. 2023 , eprint=
2023
-
[26]
2025 , eprint=
Learning Task Decomposition to Assist Humans in Competitive Programming , author=. 2025 , eprint=
2025
-
[27]
Chaudhuri, Swarat and Ellis, Kevin and Polozov, Oleksandr and Singh, Rishabh and Solar-Lezama, Armando and Yue, Yisong , title =. Found. Trends Program. Lang. , month = dec, pages =. 2021 , issue_date =. doi:10.1561/2500000049 , abstract =
-
[28]
2024 , eprint=
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator , author=. 2024 , eprint=
2024
-
[29]
2022 , eprint=
Visual Programming: Compositional visual reasoning without training , author=. 2022 , eprint=
2022
-
[30]
2025 , eprint=
Semi-structured LLM Reasoners Can Be Rigorously Audited , author=. 2025 , eprint=
2025
-
[31]
Nature , volume=
Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=
2024
-
[32]
2023 , eprint=
PAL: Program-aided Language Models , author=. 2023 , eprint=
2023
-
[33]
2024 , eprint=
Watch Your Steps: Observable and Modular Chains of Thought , author=. 2024 , eprint=
2024
-
[34]
2026 , eprint=
Intelligent AI Delegation , author=. 2026 , eprint=
2026
-
[35]
2018 , eprint=
Scalable agent alignment via reward modeling: a research direction , author=. 2018 , eprint=
2018
-
[36]
2018 , eprint=
AI safety via debate , author=. 2018 , eprint=
2018
-
[37]
2025 , eprint=
Scaling Laws For Scalable Oversight , author=. 2025 , eprint=
2025
-
[38]
2025 , eprint=
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents , author=. 2025 , eprint=
2025
-
[39]
2024 , eprint=
Evaluating Frontier Models for Dangerous Capabilities , author=. 2024 , eprint=
2024
-
[40]
Q u ALITY : Question Answering with Long Input Texts, Yes!
Pang, Richard Yuanzhe and Parrish, Alicia and Joshi, Nitish and Nangia, Nikita and Phang, Jason and Chen, Angelica and Padmakumar, Vishakh and Ma, Johnny and Thompson, Jana and He, He and Bowman, Samuel. Q u ALITY : Question Answering with Long Input Texts, Yes!. Proceedings of the 2022 Conference of the North American Chapter of the Association for Compu...
-
[41]
2023 , eprint=
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. 2023 , eprint=
2023
-
[42]
A benchmark of expert-level academic questions to assess. Nature , author =. 2026 , pages =. doi:10.1038/s41586-025-09962-4 , abstract =
work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
-
[43]
2025 , journal =
Humanity's Last Exam , author =. 2025 , journal =
2025
-
[44]
Bowman , booktitle =
David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle =. 2024 , url =
2024
-
[45]
2025 , journal =
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge , author =. 2025 , journal =
2025
-
[46]
2025 , journal =
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality , author =. 2025 , journal =
2025
-
[47]
and Ng, Lynnette Hui Xian and Carley, Kathleen M
Phillips, Samantha C. and Ng, Lynnette Hui Xian and Carley, Kathleen M. , title =. Companion Proceedings of the Web Conference 2022 , pages =. 2022 , isbn =. doi:10.1145/3487553.3524665 , abstract =
-
[48]
Makovi, Kinga and Mu. Web of lies: a tool for determining the limits of verification in preventing the spread of false information on networks , journal =. 2021 , volume =. doi:10.1038/s41598-021-82844-7 , url =
-
[49]
Gorilla Experiment Builder , year =
-
[50]
Selective Classification for Deep Neural Networks , url =
Geifman, Yonatan and El-Yaniv, Ran , booktitle =. Selective Classification for Deep Neural Networks , url =
-
[51]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
Role of Human-AI Interaction in Selective Prediction , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2022 , month=. doi:10.1609/aaai.v36i5.20465 , abstractNote=
-
[52]
Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer , url =
Madras, David and Pitassi, Toni and Zemel, Richard , booktitle =. Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer , url =
-
[53]
Confidence-Based Trust Calibration in Human-AI Teams , journal =. 2025 , publisher =. doi:10.14569/IJACSA.2025.01612122 , url =
-
[54]
OpenAI , journal=. OpenAI
-
[55]
2026 , eprint=
Cocoa: Co-Planning and Co-Execution with AI Agents , author=. 2026 , eprint=
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.