Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

Vivienne Ming

arxiv: 2607.02467 · v1 · pith:H2UFWQPCnew · submitted 2026-07-02 · 💻 cs.CY · cs.AI

Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

Vivienne Ming This is my paper

Pith reviewed 2026-07-03 05:35 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords human-AI collaborationforecastingprediction marketshybrid intelligencecollaborative traitsperspective-takingintellectual humility

0 comments

The pith

Hybrid forecasting with AI is trimodal: most people match or fall below the model, but those high in perspective-taking, humility, and curiosity reach or beat market accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses real-money prediction market outcomes as an objective benchmark to show that human-AI collaboration does not produce a uniform average effect. Instead, individual results cluster into three modes: most forecasters either defer to the model or use it only to confirm their own view, while a smaller group integrates the model through genuine complementary reasoning. This top mode reaches accuracy at or above the market itself. The distinguishing factor is not cognitive ability or model performance but specific collaborative traits measured in the participants.

Core claim

Analyzed at the level of the individual forecaster on Polymarket, hybrid performance is trimodal: most either deferred to the model or rubber-stamped a prior guess, while a minority engaged in genuine complementary reasoning and reached accuracy matching or exceeding the market. Collaborative traits of perspective-taking, intellectual humility, and curiosity rather than raw cognitive ability or model benchmarks distinguished who reached that mode.

What carries the argument

The trimodal distribution of hybrid performance outcomes, where the high-performing mode is identified by the presence of collaborative traits that enable complementary reasoning with the model.

If this is right

Selection for collaborative traits becomes necessary for effective human-AI teams in forecasting tasks.
Model benchmarks alone cannot predict whether pairing a person with AI will improve, match, or degrade accuracy.
Training programs targeting perspective-taking, intellectual humility, and curiosity could shift more individuals into the high-performing hybrid mode.
Prediction markets offer an externally resolved way to measure genuine complementary reasoning rather than self-reported collaboration quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the trimodal pattern holds, organizations deploying AI forecasting tools may need to screen or train users rather than assume uniform gains from the technology.
The same trait-based differences could appear in other high-stakes domains where humans and models must integrate information, such as medical diagnosis or investment decisions.
AI interfaces might be redesigned to prompt the specific collaborative behaviors that distinguish the top mode.

Load-bearing premise

The small pilot sample and trait measurements reliably identify the trimodal pattern and its predictors without selection bias or post-hoc identification of traits.

What would settle it

A pre-registered replication study with a larger sample that finds either no trimodal pattern or that cognitive ability or model benchmarks predict hybrid performance as well as or better than the three collaborative traits.

read the original abstract

Whether pairing people with AI helps or hurts is usually reported as a single average effect. Using a real-money prediction market (Polymarket) as an objective, externally resolved benchmark, this pilot shows that the value of human-AI collaboration depends on a specific, measurable form of human capital. Analyzed at the level of the individual forecaster, hybrid performance is trimodal: most people either deferred to the model (matching it) or used it to rubber-stamp a prior guess (performing worse than the model alone), while a minority engaged in genuine complementary reasoning and reached accuracy matching or even exceeding (i.e., lower error than) the market itself. Collaborative traits (perspective-taking, intellectual humility, and curiosity) rather than raw cognitive ability or model benchmarks, distinguished who reached that mode. The results are preliminary but statistically robust, and motivate a pre-registered replication now in preparation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pilot claims trimodal hybrid forecasting performance on Polymarket driven by collaborative traits over cognitive ability or model quality, but provides no sample size, stats, or measurement details to check the pattern.

read the letter

The main takeaway is that this pilot finds hybrid human-AI forecasting splits into three individual-level modes—deferring to the model, rubber-stamping a prior view, or genuine complementary reasoning that matches or beats the market—and that traits like perspective-taking, intellectual humility, and curiosity separate the top group while cognitive ability and model benchmarks do not.

What is new is the trimodal split itself and the claim that those collaborative traits outperform other predictors in a real prediction market. The paper does well by using Polymarket outcomes as an objective, externally resolved benchmark instead of lab tasks, and by moving past average effects to look at who benefits from the pairing.

The soft spots are mostly around missing information. The abstract calls the results statistically robust but gives no sample size, no trait measurement protocol, no definition of the three modes, and no test statistics or error bars. Without those, it is hard to tell whether the clean pattern survives small-N issues or multiple-comparison problems. The post-hoc risk on trait selection is also real in a pilot, even if a pre-registered replication is planned.

This is for researchers working on human-AI team design in forecasting, finance, or policy who care about measurable human factors. A reader looking for testable ideas on complementary skills would get something from it, though they would need the methods or the follow-up to use the result.

It deserves a serious referee because the core observation is falsifiable and the authors are open about the preliminary stage. I would recommend sending it to peer review rather than desk rejecting it.

Referee Report

3 major / 1 minor

Summary. The manuscript reports results from a pilot study of individual forecasters collaborating with AI models on a real-money prediction market (Polymarket) as an objective benchmark. It claims hybrid performance is trimodal: most forecasters either defer to the model or rubber-stamp prior beliefs (performing at or below model level), while a minority engages in genuine complementary reasoning that matches or exceeds market accuracy. Membership in the high-performing mode is predicted by collaborative human-capital traits (perspective-taking, intellectual humility, curiosity) rather than raw cognitive ability or model benchmarks. The results are described as statistically robust despite being preliminary, motivating a pre-registered replication.

Significance. If the empirical distinctions hold after full methodological disclosure, the work would shift emphasis in hybrid intelligence research from aggregate model performance to measurable individual differences in collaborative traits. The use of an externally resolved, real-stakes benchmark strengthens the objective grounding of accuracy claims and could inform forecaster selection and interface design for human-AI teams.

major comments (3)

[Abstract] Abstract: the assertion that 'the results are preliminary but statistically robust' supplies no sample size, trait measurement protocols, statistical tests, controls, or error bars, rendering it impossible to evaluate support for the trimodal pattern or the claimed superiority of collaborative traits over cognitive ability.
[Results] The operationalization of the three behavioral modes (defer, rubber-stamp, genuine complementary) is not described, including any quantitative thresholds for deviation from model outputs or market prices; this definition is load-bearing for the central trimodal claim and the subsequent trait-based partitioning.
[Methods] No information is provided on whether the collaborative trait scales were pre-specified or identified after inspecting accuracy outcomes; without this, the reported distinction between collaborative traits and cognitive ability is vulnerable to post-hoc selection or capitalization on chance in a small pilot sample.

minor comments (1)

[Abstract] The abstract states a replication 'is now in preparation' but provides no timeline or registration identifier; adding this detail would strengthen the forward-looking claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional methodological transparency will strengthen the manuscript. We address each major comment below and commit to revisions that improve clarity without altering the core claims of this pilot study.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'the results are preliminary but statistically robust' supplies no sample size, trait measurement protocols, statistical tests, controls, or error bars, rendering it impossible to evaluate support for the trimodal pattern or the claimed superiority of collaborative traits over cognitive ability.

Authors: We agree that the abstract is insufficiently detailed for independent evaluation. In revision we will expand it to report the sample size, name the specific trait scales (perspective-taking, intellectual humility, curiosity), reference the statistical tests used for the trimodal partitioning and trait comparisons, and note that results include confidence intervals or error bars. Word-count constraints will keep some protocol details in the main text. revision: yes
Referee: [Results] The operationalization of the three behavioral modes (defer, rubber-stamp, genuine complementary) is not described, including any quantitative thresholds for deviation from model outputs or market prices; this definition is load-bearing for the central trimodal claim and the subsequent trait-based partitioning.

Authors: The current manuscript does not supply the quantitative criteria used to assign forecasters to the three modes. We will add an explicit operationalization subsection (likely in Methods or Results) that defines the deviation thresholds from model outputs and market prices, the classification rules, and any robustness checks. This will make the trimodal structure fully reproducible from the data. revision: yes
Referee: [Methods] No information is provided on whether the collaborative trait scales were pre-specified or identified after inspecting accuracy outcomes; without this, the reported distinction between collaborative traits and cognitive ability is vulnerable to post-hoc selection or capitalization on chance in a small pilot sample.

Authors: The scales were drawn from established instruments in the collaborative-intelligence literature and selected on theoretical grounds before data inspection. However, the mode-partitioning procedure itself was exploratory. We will revise the Methods section to state the a-priori rationale for trait selection, explicitly label the mode identification as data-informed, report all tested variables, and underscore the planned pre-registered replication to mitigate concerns about chance capitalization. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical pilot with no derivations or self-referential steps

full rationale

The manuscript is an empirical pilot study reporting observed patterns in a real-money prediction market (Polymarket). No equations, fitted parameters, mathematical derivations, or ansatzes appear in the abstract or described methods. Claims about trimodal performance modes and trait predictors rest on data analysis rather than any reduction to inputs by construction. No self-citations are referenced as load-bearing for uniqueness or definitions. The result is self-contained against the external market benchmark and does not meet any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Full manuscript text not available; abstract contains no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5673 in / 1132 out tokens · 28986 ms · 2026-07-03T05:35:26.956052+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages

[1]

Daniotti, J

S. Daniotti, J. Wachs, X. Feng, F. Neffke, Who is using AI to code? Global diffusion and impact of generative AI. Science 391, 831–835 (2026)

2026
[2]

Vaccaro, A

M. Vaccaro, A. Almaatouq, T. Malone, When combinations of humans and AI are useful: A systematic review and meta-analysis. Nat. Hum. Behav. 8, 2293–2303 (2024)

2024
[3]

F. A. Csaszar, A. Peterson, D. Wilde, The strategic foresight of LLMs: Evidence from a fully prospective venture tournament. arXiv [econ.GN] (2026)

2026
[4]

Zöller, et al., Human-AI collectives most accurately diagnose clinical vignettes

N. Zöller, et al., Human-AI collectives most accurately diagnose clinical vignettes. Proc. Natl. Acad. Sci. U. S. A. 122, e2426153122 (2025)

2025
[5]

Kapoor, P

S. Kapoor, P. Henderson, A. Narayanan, Promises and pitfalls of artificial intelligence for legal applications. arXiv [cs.CY] (2024)

2024
[6]

A. M. Bean, et al., Measuring what matters: Construct validity in large language model benchmarks. arXiv [cs.CL] (2025)

2025
[7]

Dell’Acqua, et al., The cybernetic teammate: A field experiment on generative AI and teamwork

F. Dell’Acqua, et al., The cybernetic teammate: A field experiment on generative AI and teamwork. Organ. Sci. (2026). https://doi.org/10.1287/orsc.2025.20702

work page doi:10.1287/orsc.2025.20702 2026
[8]

Dell’Acqua, et al., Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality

F. Dell’Acqua, et al., Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. SSRN Electron. J. (2023). https://doi.org/10.2139/ssrn.4573321

work page doi:10.2139/ssrn.4573321 2023
[9]

Available at: https://www.anthropic.com/research/claude-code-expertise [Accessed 29 June 2026]

How Claude Code is used in practice. Available at: https://www.anthropic.com/research/claude-code-expertise [Accessed 29 June 2026]

2026
[10]

J. H. Shen, A. Tamkin, How AI impacts skill formation. arXiv [cs.CY] (2026)

2026
[11]

Zhou, et al., Group-AI collaboration enhances creativity performance: The roles of perspective-taking and AI utilisation strategies

Z. Zhou, et al., Group-AI collaboration enhances creativity performance: The roles of perspective-taking and AI utilisation strategies. J. Comput. Assist. Learn. 42 (2026)

2026
[12]

Ming, Robot-proof: When machines have all the answers, build better people (John Wiley & Sons, 2026)

V. Ming, Robot-proof: When machines have all the answers, build better people (John Wiley & Sons, 2026)

2026

[1] [1]

Daniotti, J

S. Daniotti, J. Wachs, X. Feng, F. Neffke, Who is using AI to code? Global diffusion and impact of generative AI. Science 391, 831–835 (2026)

2026

[2] [2]

Vaccaro, A

M. Vaccaro, A. Almaatouq, T. Malone, When combinations of humans and AI are useful: A systematic review and meta-analysis. Nat. Hum. Behav. 8, 2293–2303 (2024)

2024

[3] [3]

F. A. Csaszar, A. Peterson, D. Wilde, The strategic foresight of LLMs: Evidence from a fully prospective venture tournament. arXiv [econ.GN] (2026)

2026

[4] [4]

Zöller, et al., Human-AI collectives most accurately diagnose clinical vignettes

N. Zöller, et al., Human-AI collectives most accurately diagnose clinical vignettes. Proc. Natl. Acad. Sci. U. S. A. 122, e2426153122 (2025)

2025

[5] [5]

Kapoor, P

S. Kapoor, P. Henderson, A. Narayanan, Promises and pitfalls of artificial intelligence for legal applications. arXiv [cs.CY] (2024)

2024

[6] [6]

A. M. Bean, et al., Measuring what matters: Construct validity in large language model benchmarks. arXiv [cs.CL] (2025)

2025

[7] [7]

Dell’Acqua, et al., The cybernetic teammate: A field experiment on generative AI and teamwork

F. Dell’Acqua, et al., The cybernetic teammate: A field experiment on generative AI and teamwork. Organ. Sci. (2026). https://doi.org/10.1287/orsc.2025.20702

work page doi:10.1287/orsc.2025.20702 2026

[8] [8]

Dell’Acqua, et al., Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality

F. Dell’Acqua, et al., Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. SSRN Electron. J. (2023). https://doi.org/10.2139/ssrn.4573321

work page doi:10.2139/ssrn.4573321 2023

[9] [9]

Available at: https://www.anthropic.com/research/claude-code-expertise [Accessed 29 June 2026]

How Claude Code is used in practice. Available at: https://www.anthropic.com/research/claude-code-expertise [Accessed 29 June 2026]

2026

[10] [10]

J. H. Shen, A. Tamkin, How AI impacts skill formation. arXiv [cs.CY] (2026)

2026

[11] [11]

Zhou, et al., Group-AI collaboration enhances creativity performance: The roles of perspective-taking and AI utilisation strategies

Z. Zhou, et al., Group-AI collaboration enhances creativity performance: The roles of perspective-taking and AI utilisation strategies. J. Comput. Assist. Learn. 42 (2026)

2026

[12] [12]

Ming, Robot-proof: When machines have all the answers, build better people (John Wiley & Sons, 2026)

V. Ming, Robot-proof: When machines have all the answers, build better people (John Wiley & Sons, 2026)

2026