Recognition: 2 theorem links
· Lean TheoremVibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Pith reviewed 2026-05-15 15:56 UTC · model grok-4.3
The pith
AI models reach at most 61.8 percent success when building complete web applications from scratch.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vibe Code Bench evaluates 16 frontier AI models on generating end-to-end web applications from 100 specifications. The best model achieves 61.8 percent accuracy on the test split when applications are checked by an autonomous browser agent executing 964 workflows with 10,131 substeps. Self-testing during generation correlates strongly with success at Pearson r=0.72, while human alignment studies show evaluator choice shifts pairwise agreement from 31.8 to 93.6 percent.
What carries the argument
The autonomous browser agent that runs defined workflows on deployed applications to verify they satisfy the original specifications through 10,131 substeps.
Load-bearing premise
The autonomous browser agent and defined workflows fully and accurately determine whether a generated application satisfies the original specification without missing functionality or requiring additional human judgment.
What would settle it
A model or method that achieves over 90 percent accuracy on the held-out test split when evaluated by the same autonomous browser agent and workflows.
Figures
read the original abstract
Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows and 10,131 substeps. It evaluates 16 frontier models on end-to-end development of deployed applications using an autonomous browser agent, reporting a best-model test accuracy of 61.8%, a Pearson r=0.72 correlation between self-testing during generation and performance, and a human alignment study showing 31.8-93.6% pairwise step-level agreement. Contributions include the dataset/pipeline, model evaluation with cost/latency/error analysis, and the alignment protocol.
Significance. If the evaluation pipeline proves robust, the benchmark would fill a gap by measuring complete zero-to-one web application development rather than isolated tasks, providing concrete evidence that reliable end-to-end generation remains challenging even for frontier models. The self-testing correlation and human alignment results could inform future agent designs and evaluation practices.
major comments (3)
- [Human alignment study and evaluation pipeline] The central performance claims (61.8% test accuracy and r=0.72 self-testing correlation) rest on the autonomous browser agent's classifications of whether generated applications satisfy the original specifications. The human alignment study reports only 31.8-93.6% pairwise step-level agreement, indicating that different evaluators reach materially different verdicts; without a detailed breakdown of disagreement cases, their effect on model rankings, or an inter-rater reliability statistic (e.g., Fleiss' kappa), it is unclear whether the agent's policy systematically under- or over-counts missing functionality.
- [Abstract and § on benchmark construction] The abstract and methods description provide no information on how the 100 specifications were chosen, the criteria used to define the 964 workflows and 10,131 substeps, or the statistical procedures (including any multiple-comparison corrections) underlying the accuracy figures and Pearson correlation. These omissions make it impossible to assess whether the reported numbers are sensitive to arbitrary choices in benchmark construction.
- [Evaluation methodology] The paper claims the browser agent determines spec compliance without additional human judgment, yet the low pairwise agreement directly contradicts the assumption that the agent's verdicts are stable and objective. A sensitivity analysis showing how model rankings change under alternative agent policies or human-majority labels is required to support the 'frontier challenge' conclusion.
minor comments (2)
- [Abstract] The abstract states 'completed human alignment study' but the main text should explicitly state the number of human annotators, their expertise, and the exact protocol for step-level labeling.
- [Results section] Cost and latency results are mentioned but not tied to specific model identifiers or hardware; a table linking these metrics to the 16 models would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on Vibe Code Bench. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the manuscript.
read point-by-point responses
-
Referee: The central performance claims (61.8% test accuracy and r=0.72 self-testing correlation) rest on the autonomous browser agent's classifications of whether generated applications satisfy the original specifications. The human alignment study reports only 31.8-93.6% pairwise step-level agreement, indicating that different evaluators reach materially different verdicts; without a detailed breakdown of disagreement cases, their effect on model rankings, or an inter-rater reliability statistic (e.g., Fleiss' kappa), it is unclear whether the agent's policy systematically under- or over-counts missing functionality.
Authors: We agree that the alignment study reveals meaningful variability in human judgments, which we included precisely to illustrate the difficulty of evaluating complex, deployed web applications. In the revised manuscript we have added: (1) a case-by-case breakdown of the disagreement instances and their distribution across models, (2) the effect of those disagreements on final model rankings, and (3) Fleiss' kappa computed over the full set of step-level annotations. We also performed the requested sensitivity analysis using human-majority labels and alternative agent thresholds; model rankings remain stable, supporting the robustness of the reported 61.8% figure and the conclusion that end-to-end development remains challenging. revision: yes
-
Referee: The abstract and methods description provide no information on how the 100 specifications were chosen, the criteria used to define the 964 workflows and 10,131 substeps, or the statistical procedures (including any multiple-comparison corrections) underlying the accuracy figures and Pearson correlation. These omissions make it impossible to assess whether the reported numbers are sensitive to arbitrary choices in benchmark construction.
Authors: We have substantially expanded both the abstract and Section 3 (Benchmark Construction) to describe the curation process: the 100 specifications were drawn from a larger pool of real-world-inspired tasks, selected for diversity across domains, complexity tiers, and browser-evaluable features. Workflows and substeps were produced via a hierarchical decomposition protocol with explicit criteria and expert validation. Accuracy is the simple proportion of fully successful workflows; the Pearson r=0.72 is a single pairwise correlation with no multiple-testing correction required. These details and a brief sensitivity note have been added to the revised text. revision: yes
-
Referee: The paper claims the browser agent determines spec compliance without additional human judgment, yet the low pairwise agreement directly contradicts the assumption that the agent's verdicts are stable and objective. A sensitivity analysis showing how model rankings change under alternative agent policies or human-majority labels is required to support the 'frontier challenge' conclusion.
Authors: The agent's classification policy is fully deterministic and rule-based, producing identical verdicts on repeated runs; the observed human variability therefore reflects task difficulty rather than instability in the automated evaluator. Nevertheless, we have added the requested sensitivity analysis to the revised manuscript. Under both human-majority aggregation and several alternative agent policies, model orderings and the 61.8% headline result remain essentially unchanged, reinforcing that reliable zero-to-one web application development is still a frontier challenge. revision: yes
Circularity Check
No significant circularity; purely empirical benchmark
full rationale
The paper introduces a benchmark of 100 web app specifications evaluated via 964 browser workflows on outputs from 16 models. Reported figures (61.8% test accuracy, Pearson r=0.72 for self-testing correlation) are direct counts and statistical correlations computed from the evaluation pipeline. No equations, fitted parameters renamed as predictions, self-citation chains for uniqueness, or ansatzes appear in the derivation. The human alignment study (31.8-93.6% agreement) is presented as a separate diagnostic rather than a load-bearing input that the main results reduce to by construction. The chain consists of specification-to-generation-to-execution measurement and is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Autonomous browser workflows provide a sufficient and unbiased measure of application correctness.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across 16 frontier models, the best achieves 61.8% accuracy on the test split... self-testing during generation as a strong performance predictor (Pearson r=0.72)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
An automated evaluation pipeline using browser agents to test end-to-end workflows
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies
SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering qu...
Reference graph
Works this paper leans on
-
[1]
Anthropic. Claude code. https://claude.com/product/claude-code, 2024. Accessed: 2026-03-03
work page 2024
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint, 2021. URL https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint, 2021. URL https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Cursor: The AI-first code editor
Cursor. Cursor: The AI-first code editor. https://cursor.sh, 2024
work page 2024
-
[5]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI agents solve long-ho...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.16941 2025
-
[6]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. arXiv preprint, 2024. URL https://arxiv.org/abs/2403.07974
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real- world GitHub issues?arXiv preprint, 2024. doi: 10.48550/arXiv.2310.06770. URL https://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770 2024
-
[8]
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint, 2024. URL https://arxiv.org/abs/2401.13649
-
[9]
Lovable. Lovable. https://lovable.dev, 2025. Accessed: 2026-03-02
work page 2025
-
[10]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint, 2025. URL https://arxiv.org/ abs/2601.11868
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations.arXiv preprint, 2024. URL https://arxiv.org/abs/2411.00640
-
[12]
Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance soft- ware engineering?arXiv preprint, 2025. URL https://arxiv.org/abs/2502.12115
-
[13]
Browser use: Enable AI to control your browser
Magnus Müller and Gregor Zunić. Browser use: Enable AI to control your browser. https://github.com/browser-use/browser-use, 2024. Accessed: 2026-03- 03
work page 2024
-
[14]
Introducing SWE-bench verified
OpenAI. Introducing SWE-bench verified. https://openai.com/index/introducing- swe-bench-verified/, 2024. Accessed: 2026-03-03
work page 2024
-
[15]
OpenAI. Codex. https://developers.openai.com/codex, 2025. Accessed: 2026-03- 02
work page 2025
-
[16]
The impact of AI on developer productivity: Evidence from GitHub Copilot.arXiv preprint,
Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity: Evidence from GitHub Copilot.arXiv preprint,
-
[17]
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot
doi: 10.48550/arXiv.2302.06590. URL https://arxiv.org/abs/2302.06590
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.06590
-
[18]
Replit. Replit agent. https://replit.com/products/agent, 2025. Accessed: 2026-03- 02
work page 2025
-
[19]
Stack Overflow. Developer survey 2024. https://survey.stackoverflow.co/2024, 2024
work page 2024
- [20]
-
[21]
SWE-Bench. SWE-Bench leaderboard. https://www.swebench.com, 2026. Ac- cessed: 2026-02-27
work page 2026
-
[22]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jasber Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Ziyang Zhang, Botian Jiang, Yongliang Shen, Weiming Lu, Stephanie Lin, Yuqing Du, Wenhu Chen, and Graham Neubig. OpenHands: An open pl...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based software engineering agents.arXiv preprint, 2024. URL https://arxiv.org/abs/2407.01489
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint, 2024. URL https://arxiv.org/ abs/2405.15793
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. We- bArena: A realistic web environment for building autonomous agents.arXiv preprint, 2024. doi: 10.48550/arXiv.2307.13854. URL https://arxiv.org/abs/2307. 13854. Vibe Code Bench: Evaluating AI Models on End-to...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.13854 2024
-
[26]
The target directory already exists ( e . g . , " generated - app ") . Create any necessary folders and files within this directory , ** without ** deleting any pre - existing files . - Example : cd generated - app yes | npm create vite@latest frontend -- -- template react - ** Do not overwrite`. env`** or any other files that the user has provided . . en...
-
[27]
Keep the project ** minimal and lightweight **
-
[28]
Treat ** Supabase as the backend ** ( auth , database , storage , API ) using the`@supabase / supabase - js`client ( supabase details below )
-
[29]
Organize frontend code in a flat folder structure under`frontend / src /`: -`src / components /`-> reusable UI components -`src / pages /`-> page components -`src / hooks /`-> custom React hooks -`src / lib /`-> helper functions ( e . g . , Supabase client )
-
[30]
Use ** React Router ** for navigation if multiple pages are required
-
[31]
Write clean , composable , and reusable code ; avoid unnecessary boilerplate
-
[32]
lazy`or dynamic imports - Do not modify Vite's default config unless necessary
Follow Vite + React best practices : - Always use ** ES modules ** (`import`/`export`) - Lazy - load large components via`React . lazy`or dynamic imports - Do not modify Vite's default config unless necessary
-
[33]
Try to use Tailwind CSS v3 . When scaffolding shadcn / ui , use`shadcn@2 .3.0`for compatibility . Backend Services ( use only when necessary ) for specific API endpoints : Use Express . js with Node . js for API endpoints that can't be handled by Supabase Examples of when a backend service is needed : Most likely : third - party API integrations that requ...
-
[34]
** Create Checkout Session ** ( server - side API route ) - Link Stripe customer to Supabase user via metadata - Create payment record in Supabase with'pending'status - Set success_url and cancel_url pointing to your localhost app - Return session URL to client
-
[35]
Is the docker daemon running ?
** Success Page Verification ** ( CRITICAL !) - User returns with session_id parameter - Server - side : Retrieve session from Stripe API to verify payment status - Update Supabase tables based on verified payment status - Never trust client - side data for payment fulfillment ** Implementation :** - Create API routes :`/ api / create - checkout`and`/ api...
-
[36]
** Initial build ( and try before submitting ) **:`docker compose -p { project_id } up -d -- build`
-
[37]
** After code changes **:`docker compose -p { project_id } restart < service >`
- [38]
-
[39]
** Check the logs to make sure the application is running correctly **: Run`docker compose -p { project_id } logs -- tail 200`( or wrap`logs -f`in a short`timeout`) so the command exits on its own
-
[40]
b r o w s e r _ s c r e e n s h o t s /`) to verify functionality
** Test the app through the UI **: You must browse the running application and take screenshots ( you MUST save them in`{{ workspace_path }}/. b r o w s e r _ s c r e e n s h o t s /`) to verify functionality
-
[41]
** Before submitting **: Run`docker compose -p { project_id } down -v`
-
[42]
** Report **: Report how testing went in your final thoughts before submitting your work . Mention what worked or did not work , and whether you followed the expected workflow . Report precisely any deployment errors or issues you encountered . ## Mandatory Testing - Browse the running application and take screenshots to verify functionality - Fix issues ...
-
[43]
General Implementation - App overview and key features - Tech stack summary - Architecture approach
-
[44]
md at project root in {{ workspace_path }}/ generated - app / , at most one page long , with :
Code Structure - Frontend / backend directory layout - Key files and their purposes - Database schema Second , a user - friendly USER_README . md at project root in {{ workspace_path }}/ generated - app / , at most one page long , with :
-
[45]
Navigation Guide to Test the Application - Authentication ( admin or not ) method if any - Key URLs and actions - If there is an authentication flow , provide default credentials to be able to test the application without recreating an account . Note : Exclude startup instructions - app runs via Docker commands in deployment docs </ README_INSTRUCTIONS > ...
-
[46]
- Members can set and update display name , bio , and profile image
Accounts and Profiles - Members can create accounts with username , email , and password . - Members can set and update display name , bio , and profile image . - Public profile pages show member identity fields and recent posts
-
[47]
- Post content is limited to 280 characters
Posts - Members can create posts from the home feed . - Post content is limited to 280 characters . - Members can edit and delete only their own posts
-
[48]
- Feed supports sorting by Newest and Trending
Feed and Discovery - Member home feed includes followed - member content and discovery content . - Feed supports sorting by Newest and Trending . - Feed supports search by keyword and hashtag
-
[49]
Social Interactions - Members can follow other members . - Members can like posts . - Members can comment on posts
-
[50]
Content and Constraints - Text - only posting experience for this version ( no media in posts )
In - App Notifications - Notifications page shows follow , like , and comment events relevant to the member . Content and Constraints - Text - only posting experience for this version ( no media in posts ) . - All key workflows must be executable in one browser session . Primary User Flows - Onboard and profile setup : sign up , complete profile fields , ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.