VCBench: Benchmarking LLMs in Venture Capital
Pith reviewed 2026-05-18 15:24 UTC · model grok-4.3
The pith
LLMs can predict founder success in venture capital far better than market baselines or human benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that by providing a large set of standardized and anonymized founder profiles, VCBench enables LLMs to demonstrate predictive capabilities in venture capital that surpass both simple market indices and human benchmarks, with specific models showing substantial gains in precision metrics.
What carries the argument
VCBench dataset consisting of 9,000 anonymized founder profiles standardized to retain success-predicting features while minimizing identity leakage.
If this is right
- If correct, LLMs could become reliable tools for initial screening in venture capital.
- The benchmark sets a standard that future models must meet or exceed to claim superiority in this task.
- Privacy-preserving data sharing methods are validated for use in similar sensitive prediction problems.
Where Pith is reading between the lines
- This suggests potential for AI to democratize access to investment insights beyond elite firms.
- Similar benchmarks could be developed for other domains with uncertain outcomes such as talent scouting in sports or academia.
- The results raise questions about what specific features in founder profiles the models are using that humans might overlook.
Load-bearing premise
Anonymization of the founder profiles does not remove the information needed to predict real success outcomes.
What would settle it
A longitudinal study that follows the actual funding and exit outcomes of the profiled founders and measures correlation with the LLM predictions.
Figures
read the original abstract
Benchmarks such as SWE-bench and ARC-AGI demonstrate how shared datasets accelerate progress toward artificial general intelligence (AGI). We introduce VCBench, the first benchmark for predicting founder success in venture capital (VC), a domain where signals are sparse, outcomes are uncertain, and even top investors perform modestly. At inception, the market index achieves a precision of 1.9%. Y Combinator outperforms the index by a factor of 1.7x, while tier-1 firms are 2.9x better. VCBench provides 9,000 anonymized founder profiles, standardized to preserve predictive features while resisting identity leakage, with adversarial tests showing more than 90% reduction in re-identification risk. We evaluate nine state-of-the-art large language models (LLMs). DeepSeek-V3 delivers over six times the baseline precision, GPT-4o achieves the highest F0.5, and most models surpass human benchmarks. Designed as a public and evolving resource available at vcbench.com, VCBench establishes a community-driven standard for reproducible and privacy-preserving evaluation of AGI in early-stage venture forecasting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VCBench, the first benchmark for LLM-based prediction of founder success in venture capital. It provides 9,000 anonymized founder profiles that are standardized to preserve predictive features while reducing re-identification risk by over 90% via adversarial tests. The market index baseline achieves 1.9% precision; Y Combinator outperforms it by 1.7x and tier-1 firms by 2.9x. Evaluation of nine state-of-the-art LLMs shows DeepSeek-V3 delivering over 6x baseline precision, GPT-4o achieving the highest F0.5, and most models surpassing human benchmarks. The benchmark is positioned as a public, evolving resource at vcbench.com for reproducible, privacy-preserving evaluation of AGI in early-stage venture forecasting.
Significance. If the central performance claims hold after addressing verification gaps, VCBench would provide a valuable public benchmark for sparse-signal, high-uncertainty forecasting tasks, analogous to SWE-bench or ARC-AGI. The combination of a standardized dataset, explicit privacy guarantees, and direct comparison to market index, top accelerators, and human performance offers a reproducible testbed that could accelerate research on LLM capabilities in real-world decision domains with asymmetric information.
major comments (2)
- Abstract: The assertion that standardization of the 9,000 profiles 'preserve[s] predictive features' while achieving >90% re-identification risk reduction lacks any supporting ablation, expert re-labeling, or direct performance comparison between original and anonymized versions. This is load-bearing for the central claim that LLMs achieve 6x baseline precision and surpass human benchmarks, as anonymization could remove or alter sparse signals (e.g., specific prior exits or network indicators) that drive the reported gains.
- Abstract / Evaluation protocol: No details are provided on label accuracy for founder success outcomes, exact data sourcing methodology, statistical significance tests, or the precise prompting and scoring procedure used for the nine LLMs. These omissions make it impossible to verify that the reported precision and F0.5 improvements are robust rather than artifacts of evaluation choices.
minor comments (2)
- Abstract: The term 'F0.5' is used without a brief definition or reference to its weighting (precision-oriented F-beta), which may confuse readers unfamiliar with the metric in this context.
- Abstract: The claim that 'most models surpass human benchmarks' would benefit from a specific citation or table row identifying the human baseline performance numbers for direct comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for highlighting areas where greater methodological transparency would strengthen the paper. We address each major comment below and commit to revisions that improve verifiability without altering the core claims.
read point-by-point responses
-
Referee: Abstract: The assertion that standardization of the 9,000 profiles 'preserve[s] predictive features' while achieving >90% re-identification risk reduction lacks any supporting ablation, expert re-labeling, or direct performance comparison between original and anonymized versions. This is load-bearing for the central claim that LLMs achieve 6x baseline precision and surpass human benchmarks, as anonymization could remove or alter sparse signals (e.g., specific prior exits or network indicators) that drive the reported gains.
Authors: We acknowledge that the manuscript does not present an explicit ablation comparing LLM and baseline performance on the original versus standardized profiles. The anonymization pipeline was developed with input from domain experts to retain structured fields such as education history, prior roles, and network indicators while stripping direct identifiers, and adversarial re-identification tests were run to measure privacy gains. In the revised manuscript we will add a dedicated subsection with (i) a description of the expert consultation process, (ii) a small-scale performance comparison on a held-out subset of profiles before and after standardization, and (iii) quantitative evidence that the key sparse signals remain intact. This will directly address the concern that anonymization may have inadvertently removed predictive information. revision: yes
-
Referee: Abstract / Evaluation protocol: No details are provided on label accuracy for founder success outcomes, exact data sourcing methodology, statistical significance tests, or the precise prompting and scoring procedure used for the nine LLMs. These omissions make it impossible to verify that the reported precision and F0.5 improvements are robust rather than artifacts of evaluation choices.
Authors: We agree that these elements are necessary for independent verification. Founder-success labels were obtained from public company registries, funding announcements, and exit databases with cross-validation across sources. In the revision we will expand the Evaluation Protocol section to include: exact data sources and collection dates; a manual audit of label accuracy on a random sample with reported agreement rate; bootstrap confidence intervals and paired statistical tests for all reported metrics; and the complete prompting templates, decoding parameters, and scoring rubric applied to each of the nine LLMs. These additions will allow readers to reproduce and assess the robustness of the precision and F0.5 results. revision: yes
Circularity Check
No significant circularity: external benchmark evaluated against independent indices
full rationale
The paper introduces VCBench as a new dataset of 9,000 anonymized founder profiles and reports empirical LLM performance metrics (e.g., DeepSeek-V3 achieving >6x baseline precision) against an external market index (1.9% precision) and human benchmarks. No derivation chain reduces results to fitted parameters by construction, no self-citation is invoked as a load-bearing uniqueness theorem, and no ansatz or renaming of known results is presented as a first-principles derivation. The standardization claim is an empirical assertion supported by adversarial re-identification tests rather than a self-referential definition of success or predictive power.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Anonymized founder profiles retain sufficient predictive information for success while protecting identity.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VCBench provides 9,000 anonymized founder profiles, standardized to preserve predictive features while resisting identity leakage, with adversarial tests showing more than 90% reduction in re-identification risk. We evaluate nine state-of-the-art large language models (LLMs).
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a generalizable pipeline for data cleaning and anonymization, and validate it with adversarial re-identification tests.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches
YC Bench is a new live benchmark that evaluates forecasting models for startup outperformance within YC batches using a short-term Pre-Demo Day Score derived from public traction signals.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
VQA: Visual Question Answering
Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Batra, D., and Parikh, D. Vqa: Visual question answering, 2016. URL https://arxiv.org/abs/1505.00468
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Assessment of machine learning performance for decision support in venture capital investments
Arroyo, J., Corea, F., Jiménez-Díaz, G., and Recio-García, J. Assessment of machine learning performance for decision support in venture capital investments. IEEE Access, PP: 0 1--1, 08 2019. doi:10.1109/ACCESS.2019.2938659
-
[4]
On the Measure of Intelligence
Chollet, F. On the measure of intelligence, 2019. URL https://arxiv.org/abs/1911.01547
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[5]
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Chollet, F., Knoop, M., Kamradt, G., Landers, B., and Pinkard, H. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URL https://arxiv.org/abs/2505.11831
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Dellermann, D., Lipusch, N., Ebel, P., Popp, K. M., and Leimeister, J. M. Finding the unicorn: Predicting early stage startup success through a hybrid intelligence method, 2021. URL https://arxiv.org/abs/2105.03360
-
[7]
Griffin, B., Ternasky, J., Alican, F., and Ihlamur, Y. Random rule forest (rrf): Interpretable ensembles of llm-generated questions for predicting startup success, 2025. URL https://arxiv.org/abs/2505.24622
-
[8]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
A fused large language model for predicting startup success
Maarouf, A., Feuerriegel, S., and Pröllochs, N. A fused large language model for predicting startup success. European Journal of Operational Research, 322 0 (1): 0 198–214, April 2025. ISSN 0377-2217. doi:10.1016/j.ejor.2024.09.011. URL http://dx.doi.org/10.1016/j.ejor.2024.09.011
-
[10]
Mu, X., Ternasky, J., Alican, F., and Ihlamur, Y. Policy induction: Predicting startup success via explainable memory-augmented in-context learning, 2025. URL https://arxiv.org/abs/2505.21427
-
[11]
Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025
Nori, H., Daswani, M., Kelly, C., Lundberg, S., Ribeiro, M. T., Wilson, M., Liu, X., Sounderajah, V., Carlson, J., Lungren, M. P., Gross, B., Hames, P., Suleyman, M., King, D., and Horvitz, E. Sequential diagnosis with language models, 2025. URL https://arxiv.org/abs/2506.22405
-
[12]
Startup success prediction and vc portfolio simulation using crunchbase data, 2023
Potanin, M., Chertok, A., Zorin, K., and Shtabtsovsky, C. Startup success prediction and vc portfolio simulation using crunchbase data, 2023. URL https://arxiv.org/abs/2309.15552
-
[13]
Predicting the success of startups using a machine learning approach
Razaghzadeh Bidgoli, M., Raeesi Vanani, I., and Goodarzi, M. Predicting the success of startups using a machine learning approach. Journal of Innovation and Entrepreneurship, 13: 0 80, 2024. doi:10.1186/s13731-024-00436-x. URL https://doi.org/10.1186/s13731-024-00436-x
-
[14]
ImageNet Large Scale Visual Recognition Challenge
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge, 2015. URL https://arxiv.org/abs/1409.0575
work page internal anchor Pith review Pith/arXiv arXiv 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.