VCBench: Benchmarking LLMs in Venture Capital

Aaron Ontoyin Yin; Afriyie Samuel Kwesi; Ben Griffin; Fuat Alican; Joseph Ternasky; Kelvin Amoaba; Rick Chen; Xianling Mu; Yigit Ihlamur; Zakari Salifu

arxiv: 2509.14448 · v2 · submitted 2025-09-17 · 💻 cs.AI

VCBench: Benchmarking LLMs in Venture Capital

Rick Chen , Joseph Ternasky , Afriyie Samuel Kwesi , Ben Griffin , Aaron Ontoyin Yin , Zakari Salifu , Kelvin Amoaba , Xianling Mu

show 2 more authors

Fuat Alican Yigit Ihlamur

This is my paper

Pith reviewed 2026-05-18 15:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords VCBenchLLM benchmarkingventure capitalfounder successanonymized dataprecision evaluation

0 comments

The pith

LLMs can predict founder success in venture capital far better than market baselines or human benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

VCBench is introduced as the first shared benchmark for assessing how well large language models can forecast which early-stage founders will succeed. The benchmark uses 9,000 anonymized profiles that maintain key predictive signals while cutting re-identification risks by more than 90 percent through testing. Evaluations reveal that DeepSeek-V3 achieves over six times the 1.9 percent precision of a basic market index, GPT-4o leads in balanced scoring, and most models exceed typical human investor performance. This resource is made public to encourage ongoing community evaluation and improvement in AI applications for high-uncertainty domains like venture capital.

Core claim

The central discovery is that by providing a large set of standardized and anonymized founder profiles, VCBench enables LLMs to demonstrate predictive capabilities in venture capital that surpass both simple market indices and human benchmarks, with specific models showing substantial gains in precision metrics.

What carries the argument

VCBench dataset consisting of 9,000 anonymized founder profiles standardized to retain success-predicting features while minimizing identity leakage.

If this is right

If correct, LLMs could become reliable tools for initial screening in venture capital.
The benchmark sets a standard that future models must meet or exceed to claim superiority in this task.
Privacy-preserving data sharing methods are validated for use in similar sensitive prediction problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests potential for AI to democratize access to investment insights beyond elite firms.
Similar benchmarks could be developed for other domains with uncertain outcomes such as talent scouting in sports or academia.
The results raise questions about what specific features in founder profiles the models are using that humans might overlook.

Load-bearing premise

Anonymization of the founder profiles does not remove the information needed to predict real success outcomes.

What would settle it

A longitudinal study that follows the actual funding and exit outcomes of the profiled founders and measures correlation with the LLM predictions.

Figures

Figures reproduced from arXiv: 2509.14448 by Aaron Ontoyin Yin, Afriyie Samuel Kwesi, Ben Griffin, Fuat Alican, Joseph Ternasky, Kelvin Amoaba, Rick Chen, Xianling Mu, Yigit Ihlamur, Zakari Salifu.

**Figure 1.** Figure 1: Predictive performances of nine vanilla LLMs on VCBench, with human-level baselines. The human-level baseline results are scaled linearly to reflect the inflation of success rate from the real-world (1.9%) to VCBench (9%). 2. Related Work Benchmarks in machine learning. Benchmark datasets have long been central to machine learning progress. In vision and language, VQA (Agrawal et al., 2016) and ImageNet … view at source ↗

**Figure 2.** Figure 2: Data Cleaning Pipeline Record Type Original No. unique entries Final No. unique entries Percentage Reduction industry 314 61 80.6% education degree 2155 404 81.3% education field of study 6360 3969 37.6% job role 21259 16374 23.0% education record 20573 15620 24.1% job record 45975 41183 10.4% [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of industries in VCBench after bucketing [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of startup founding years in VCBench. C. Per-fold Results for Vanilla LLMs [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Benchmarks such as SWE-bench and ARC-AGI demonstrate how shared datasets accelerate progress toward artificial general intelligence (AGI). We introduce VCBench, the first benchmark for predicting founder success in venture capital (VC), a domain where signals are sparse, outcomes are uncertain, and even top investors perform modestly. At inception, the market index achieves a precision of 1.9%. Y Combinator outperforms the index by a factor of 1.7x, while tier-1 firms are 2.9x better. VCBench provides 9,000 anonymized founder profiles, standardized to preserve predictive features while resisting identity leakage, with adversarial tests showing more than 90% reduction in re-identification risk. We evaluate nine state-of-the-art large language models (LLMs). DeepSeek-V3 delivers over six times the baseline precision, GPT-4o achieves the highest F0.5, and most models surpass human benchmarks. Designed as a public and evolving resource available at vcbench.com, VCBench establishes a community-driven standard for reproducible and privacy-preserving evaluation of AGI in early-stage venture forecasting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VCBench, the first benchmark for LLM-based prediction of founder success in venture capital. It provides 9,000 anonymized founder profiles that are standardized to preserve predictive features while reducing re-identification risk by over 90% via adversarial tests. The market index baseline achieves 1.9% precision; Y Combinator outperforms it by 1.7x and tier-1 firms by 2.9x. Evaluation of nine state-of-the-art LLMs shows DeepSeek-V3 delivering over 6x baseline precision, GPT-4o achieving the highest F0.5, and most models surpassing human benchmarks. The benchmark is positioned as a public, evolving resource at vcbench.com for reproducible, privacy-preserving evaluation of AGI in early-stage venture forecasting.

Significance. If the central performance claims hold after addressing verification gaps, VCBench would provide a valuable public benchmark for sparse-signal, high-uncertainty forecasting tasks, analogous to SWE-bench or ARC-AGI. The combination of a standardized dataset, explicit privacy guarantees, and direct comparison to market index, top accelerators, and human performance offers a reproducible testbed that could accelerate research on LLM capabilities in real-world decision domains with asymmetric information.

major comments (2)

Abstract: The assertion that standardization of the 9,000 profiles 'preserve[s] predictive features' while achieving >90% re-identification risk reduction lacks any supporting ablation, expert re-labeling, or direct performance comparison between original and anonymized versions. This is load-bearing for the central claim that LLMs achieve 6x baseline precision and surpass human benchmarks, as anonymization could remove or alter sparse signals (e.g., specific prior exits or network indicators) that drive the reported gains.
Abstract / Evaluation protocol: No details are provided on label accuracy for founder success outcomes, exact data sourcing methodology, statistical significance tests, or the precise prompting and scoring procedure used for the nine LLMs. These omissions make it impossible to verify that the reported precision and F0.5 improvements are robust rather than artifacts of evaluation choices.

minor comments (2)

Abstract: The term 'F0.5' is used without a brief definition or reference to its weighting (precision-oriented F-beta), which may confuse readers unfamiliar with the metric in this context.
Abstract: The claim that 'most models surpass human benchmarks' would benefit from a specific citation or table row identifying the human baseline performance numbers for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where greater methodological transparency would strengthen the paper. We address each major comment below and commit to revisions that improve verifiability without altering the core claims.

read point-by-point responses

Referee: Abstract: The assertion that standardization of the 9,000 profiles 'preserve[s] predictive features' while achieving >90% re-identification risk reduction lacks any supporting ablation, expert re-labeling, or direct performance comparison between original and anonymized versions. This is load-bearing for the central claim that LLMs achieve 6x baseline precision and surpass human benchmarks, as anonymization could remove or alter sparse signals (e.g., specific prior exits or network indicators) that drive the reported gains.

Authors: We acknowledge that the manuscript does not present an explicit ablation comparing LLM and baseline performance on the original versus standardized profiles. The anonymization pipeline was developed with input from domain experts to retain structured fields such as education history, prior roles, and network indicators while stripping direct identifiers, and adversarial re-identification tests were run to measure privacy gains. In the revised manuscript we will add a dedicated subsection with (i) a description of the expert consultation process, (ii) a small-scale performance comparison on a held-out subset of profiles before and after standardization, and (iii) quantitative evidence that the key sparse signals remain intact. This will directly address the concern that anonymization may have inadvertently removed predictive information. revision: yes
Referee: Abstract / Evaluation protocol: No details are provided on label accuracy for founder success outcomes, exact data sourcing methodology, statistical significance tests, or the precise prompting and scoring procedure used for the nine LLMs. These omissions make it impossible to verify that the reported precision and F0.5 improvements are robust rather than artifacts of evaluation choices.

Authors: We agree that these elements are necessary for independent verification. Founder-success labels were obtained from public company registries, funding announcements, and exit databases with cross-validation across sources. In the revision we will expand the Evaluation Protocol section to include: exact data sources and collection dates; a manual audit of label accuracy on a random sample with reported agreement rate; bootstrap confidence intervals and paired statistical tests for all reported metrics; and the complete prompting templates, decoding parameters, and scoring rubric applied to each of the nine LLMs. These additions will allow readers to reproduce and assess the robustness of the precision and F0.5 results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: external benchmark evaluated against independent indices

full rationale

The paper introduces VCBench as a new dataset of 9,000 anonymized founder profiles and reports empirical LLM performance metrics (e.g., DeepSeek-V3 achieving >6x baseline precision) against an external market index (1.9% precision) and human benchmarks. No derivation chain reduces results to fitted parameters by construction, no self-citation is invoked as a load-bearing uniqueness theorem, and no ansatz or renaming of known results is presented as a first-principles derivation. The standardization claim is an empirical assertion supported by adversarial re-identification tests rather than a self-referential definition of success or predictive power.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the collected profiles are representative of real VC founder outcomes and that anonymization does not remove predictive signal. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Anonymized founder profiles retain sufficient predictive information for success while protecting identity.
Stated directly in the abstract as the basis for the benchmark construction.

pith-pipeline@v0.9.0 · 5762 in / 1228 out tokens · 58491 ms · 2026-05-18T15:24:14.737742+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VCBench provides 9,000 anonymized founder profiles, standardized to preserve predictive features while resisting identity leakage, with adversarial tests showing more than 90% reduction in re-identification risk. We evaluate nine state-of-the-art large language models (LLMs).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a generalizable pipeline for data cleaning and anonymization, and validate it with adversarial re-identification tests.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches
cs.LG 2026-04 accept novelty 6.0

YC Bench is a new live benchmark that evaluates forecasting models for startup outperformance within YC batches using a short-term Pre-Demo Day Score derived from public traction signals.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

VQA: Visual Question Answering

Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Batra, D., and Parikh, D. Vqa: Visual question answering, 2016. URL https://arxiv.org/abs/1505.00468

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Assessment of machine learning performance for decision support in venture capital investments

Arroyo, J., Corea, F., Jiménez-Díaz, G., and Recio-García, J. Assessment of machine learning performance for decision support in venture capital investments. IEEE Access, PP: 0 1--1, 08 2019. doi:10.1109/ACCESS.2019.2938659

work page doi:10.1109/access.2019.2938659 2019
[4]

On the Measure of Intelligence

Chollet, F. On the measure of intelligence, 2019. URL https://arxiv.org/abs/1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Chollet, F., Knoop, M., Kamradt, G., Landers, B., and Pinkard, H. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URL https://arxiv.org/abs/2505.11831

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

M., and Leimeister, J

Dellermann, D., Lipusch, N., Ebel, P., Popp, K. M., and Leimeister, J. M. Finding the unicorn: Predicting early stage startup success through a hybrid intelligence method, 2021. URL https://arxiv.org/abs/2105.03360

work page arXiv 2021
[7]

Random rule forest (rrf): Interpretable ensembles of llm-generated questions for predicting startup success, 2025

Griffin, B., Ternasky, J., Alican, F., and Ihlamur, Y. Random rule forest (rrf): Interpretable ensembles of llm-generated questions for predicting startup success, 2025. URL https://arxiv.org/abs/2505.24622

work page arXiv 2025
[8]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

A fused large language model for predicting startup success

Maarouf, A., Feuerriegel, S., and Pröllochs, N. A fused large language model for predicting startup success. European Journal of Operational Research, 322 0 (1): 0 198–214, April 2025. ISSN 0377-2217. doi:10.1016/j.ejor.2024.09.011. URL http://dx.doi.org/10.1016/j.ejor.2024.09.011

work page doi:10.1016/j.ejor.2024.09.011 2025
[10]

Policy induction: Predicting startup success via explainable memory-augmented in-context learning, 2025

Mu, X., Ternasky, J., Alican, F., and Ihlamur, Y. Policy induction: Predicting startup success via explainable memory-augmented in-context learning, 2025. URL https://arxiv.org/abs/2505.21427

work page arXiv 2025
[11]

Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

Nori, H., Daswani, M., Kelly, C., Lundberg, S., Ribeiro, M. T., Wilson, M., Liu, X., Sounderajah, V., Carlson, J., Lungren, M. P., Gross, B., Hames, P., Suleyman, M., King, D., and Horvitz, E. Sequential diagnosis with language models, 2025. URL https://arxiv.org/abs/2506.22405

work page arXiv 2025
[12]

Startup success prediction and vc portfolio simulation using crunchbase data, 2023

Potanin, M., Chertok, A., Zorin, K., and Shtabtsovsky, C. Startup success prediction and vc portfolio simulation using crunchbase data, 2023. URL https://arxiv.org/abs/2309.15552

work page arXiv 2023
[13]

Predicting the success of startups using a machine learning approach

Razaghzadeh Bidgoli, M., Raeesi Vanani, I., and Goodarzi, M. Predicting the success of startups using a machine learning approach. Journal of Innovation and Entrepreneurship, 13: 0 80, 2024. doi:10.1186/s13731-024-00436-x. URL https://doi.org/10.1186/s13731-024-00436-x

work page doi:10.1186/s13731-024-00436-x 2024
[14]

ImageNet Large Scale Visual Recognition Challenge

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge, 2015. URL https://arxiv.org/abs/1409.0575

work page internal anchor Pith review Pith/arXiv arXiv 2015

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

VQA: Visual Question Answering

Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Batra, D., and Parikh, D. Vqa: Visual question answering, 2016. URL https://arxiv.org/abs/1505.00468

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Assessment of machine learning performance for decision support in venture capital investments

Arroyo, J., Corea, F., Jiménez-Díaz, G., and Recio-García, J. Assessment of machine learning performance for decision support in venture capital investments. IEEE Access, PP: 0 1--1, 08 2019. doi:10.1109/ACCESS.2019.2938659

work page doi:10.1109/access.2019.2938659 2019

[4] [4]

On the Measure of Intelligence

Chollet, F. On the measure of intelligence, 2019. URL https://arxiv.org/abs/1911.01547

work page internal anchor Pith review Pith/arXiv arXiv 2019

[5] [5]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Chollet, F., Knoop, M., Kamradt, G., Landers, B., and Pinkard, H. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URL https://arxiv.org/abs/2505.11831

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

M., and Leimeister, J

Dellermann, D., Lipusch, N., Ebel, P., Popp, K. M., and Leimeister, J. M. Finding the unicorn: Predicting early stage startup success through a hybrid intelligence method, 2021. URL https://arxiv.org/abs/2105.03360

work page arXiv 2021

[7] [7]

Random rule forest (rrf): Interpretable ensembles of llm-generated questions for predicting startup success, 2025

Griffin, B., Ternasky, J., Alican, F., and Ihlamur, Y. Random rule forest (rrf): Interpretable ensembles of llm-generated questions for predicting startup success, 2025. URL https://arxiv.org/abs/2505.24622

work page arXiv 2025

[8] [8]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

A fused large language model for predicting startup success

Maarouf, A., Feuerriegel, S., and Pröllochs, N. A fused large language model for predicting startup success. European Journal of Operational Research, 322 0 (1): 0 198–214, April 2025. ISSN 0377-2217. doi:10.1016/j.ejor.2024.09.011. URL http://dx.doi.org/10.1016/j.ejor.2024.09.011

work page doi:10.1016/j.ejor.2024.09.011 2025

[10] [10]

Policy induction: Predicting startup success via explainable memory-augmented in-context learning, 2025

Mu, X., Ternasky, J., Alican, F., and Ihlamur, Y. Policy induction: Predicting startup success via explainable memory-augmented in-context learning, 2025. URL https://arxiv.org/abs/2505.21427

work page arXiv 2025

[11] [11]

Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

Nori, H., Daswani, M., Kelly, C., Lundberg, S., Ribeiro, M. T., Wilson, M., Liu, X., Sounderajah, V., Carlson, J., Lungren, M. P., Gross, B., Hames, P., Suleyman, M., King, D., and Horvitz, E. Sequential diagnosis with language models, 2025. URL https://arxiv.org/abs/2506.22405

work page arXiv 2025

[12] [12]

Startup success prediction and vc portfolio simulation using crunchbase data, 2023

Potanin, M., Chertok, A., Zorin, K., and Shtabtsovsky, C. Startup success prediction and vc portfolio simulation using crunchbase data, 2023. URL https://arxiv.org/abs/2309.15552

work page arXiv 2023

[13] [13]

Predicting the success of startups using a machine learning approach

Razaghzadeh Bidgoli, M., Raeesi Vanani, I., and Goodarzi, M. Predicting the success of startups using a machine learning approach. Journal of Innovation and Entrepreneurship, 13: 0 80, 2024. doi:10.1186/s13731-024-00436-x. URL https://doi.org/10.1186/s13731-024-00436-x

work page doi:10.1186/s13731-024-00436-x 2024

[14] [14]

ImageNet Large Scale Visual Recognition Challenge

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge, 2015. URL https://arxiv.org/abs/1409.0575

work page internal anchor Pith review Pith/arXiv arXiv 2015