arxiv: 2603.04601 · v3 · submitted 2026-03-04 · 💻 cs.SE · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran , Langston Nashold , Rayan Krishnan , Antoine Bigeard , Alex Gu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords AI code generationweb application developmentbenchmarkend-to-end evaluationfrontier modelsself-testingbrowser agentapplication correctness

0 comments

The pith

AI models reach at most 61.8 percent success when building complete web applications from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Vibe Code Bench to measure how well AI models can create working web applications from full specifications instead of isolated code snippets. It runs 16 frontier models through 100 specifications and checks the resulting deployed apps with an autonomous browser agent that executes 964 workflows. The top model passes 61.8 percent of the held-out test cases, and models that test their own output during generation score higher with a 0.72 correlation. The work also measures how much different evaluators disagree on whether an app meets its spec.

Core claim

Vibe Code Bench evaluates 16 frontier AI models on generating end-to-end web applications from 100 specifications. The best model achieves 61.8 percent accuracy on the test split when applications are checked by an autonomous browser agent executing 964 workflows with 10,131 substeps. Self-testing during generation correlates strongly with success at Pearson r=0.72, while human alignment studies show evaluator choice shifts pairwise agreement from 31.8 to 93.6 percent.

What carries the argument

The autonomous browser agent that runs defined workflows on deployed applications to verify they satisfy the original specifications through 10,131 substeps.

Load-bearing premise

The autonomous browser agent and defined workflows fully and accurately determine whether a generated application satisfies the original specification without missing functionality or requiring additional human judgment.

What would settle it

A model or method that achieves over 90 percent accuracy on the held-out test split when evaluated by the same autonomous browser agent and workflows.

Figures

Figures reproduced from arXiv: 2603.04601 by Alex Gu, Antoine Bigeard, Hung Tran, Langston Nashold, Rayan Krishnan.

**Figure 2.** Figure 2: Automated evaluation flow from deployed app to workflow pass/fail scoring. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy–cost and accuracy–latency trade-offs. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Performance by Reasoning Effort (20 app subset). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Vision ablation (20 app subset) Vision Ablation. Disabling vision reduces accuracy for GPT-5.3- Codex and Gemini 3.1 Pro, but has little effect on Opus 4.6 Thinking in this run ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Aggregate trajectory action composition across all tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Application pass-rate histograms for all evaluated models (test split). [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Trajectory timeline by model on a single application ( [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vibe Code Bench adds a concrete end-to-end web-app benchmark but its agent-based scoring looks shaky given the low human agreement numbers.

read the letter

The one thing to know is that this benchmark shows current models still struggle with full web apps, topping out at 61.8%, and that self-testing helps a lot. But the way they score the apps leaves room for doubt. They do a solid job putting together 100 real-world-ish specs and running them through a full browser agent pipeline across 16 models. The correlation with self-testing is a useful signal, and including the human agreement numbers shows they are not hiding the messiness of evaluation. The soft spot is the reliance on that single agent for all judgments. With human agreement as low as 31.8% on some steps, the model rankings and the frontier challenge claim rest on something that varies by evaluator. The paper would be stronger with more analysis on where the agent and humans diverge and why the specs were chosen the way they were. Readers working on agentic coding systems will find the dataset and pipeline useful to build on. It deserves serious referee time because end-to-end benchmarks are rare and this one is concrete, even if the current scoring method needs work.

Referee Report

3 major / 2 minor

Summary. The paper introduces Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows and 10,131 substeps. It evaluates 16 frontier models on end-to-end development of deployed applications using an autonomous browser agent, reporting a best-model test accuracy of 61.8%, a Pearson r=0.72 correlation between self-testing during generation and performance, and a human alignment study showing 31.8-93.6% pairwise step-level agreement. Contributions include the dataset/pipeline, model evaluation with cost/latency/error analysis, and the alignment protocol.

Significance. If the evaluation pipeline proves robust, the benchmark would fill a gap by measuring complete zero-to-one web application development rather than isolated tasks, providing concrete evidence that reliable end-to-end generation remains challenging even for frontier models. The self-testing correlation and human alignment results could inform future agent designs and evaluation practices.

major comments (3)

[Human alignment study and evaluation pipeline] The central performance claims (61.8% test accuracy and r=0.72 self-testing correlation) rest on the autonomous browser agent's classifications of whether generated applications satisfy the original specifications. The human alignment study reports only 31.8-93.6% pairwise step-level agreement, indicating that different evaluators reach materially different verdicts; without a detailed breakdown of disagreement cases, their effect on model rankings, or an inter-rater reliability statistic (e.g., Fleiss' kappa), it is unclear whether the agent's policy systematically under- or over-counts missing functionality.
[Abstract and § on benchmark construction] The abstract and methods description provide no information on how the 100 specifications were chosen, the criteria used to define the 964 workflows and 10,131 substeps, or the statistical procedures (including any multiple-comparison corrections) underlying the accuracy figures and Pearson correlation. These omissions make it impossible to assess whether the reported numbers are sensitive to arbitrary choices in benchmark construction.
[Evaluation methodology] The paper claims the browser agent determines spec compliance without additional human judgment, yet the low pairwise agreement directly contradicts the assumption that the agent's verdicts are stable and objective. A sensitivity analysis showing how model rankings change under alternative agent policies or human-majority labels is required to support the 'frontier challenge' conclusion.

minor comments (2)

[Abstract] The abstract states 'completed human alignment study' but the main text should explicitly state the number of human annotators, their expertise, and the exact protocol for step-level labeling.
[Results section] Cost and latency results are mentioned but not tied to specific model identifiers or hardware; a table linking these metrics to the 16 models would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on Vibe Code Bench. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the manuscript.

read point-by-point responses

Referee: The central performance claims (61.8% test accuracy and r=0.72 self-testing correlation) rest on the autonomous browser agent's classifications of whether generated applications satisfy the original specifications. The human alignment study reports only 31.8-93.6% pairwise step-level agreement, indicating that different evaluators reach materially different verdicts; without a detailed breakdown of disagreement cases, their effect on model rankings, or an inter-rater reliability statistic (e.g., Fleiss' kappa), it is unclear whether the agent's policy systematically under- or over-counts missing functionality.

Authors: We agree that the alignment study reveals meaningful variability in human judgments, which we included precisely to illustrate the difficulty of evaluating complex, deployed web applications. In the revised manuscript we have added: (1) a case-by-case breakdown of the disagreement instances and their distribution across models, (2) the effect of those disagreements on final model rankings, and (3) Fleiss' kappa computed over the full set of step-level annotations. We also performed the requested sensitivity analysis using human-majority labels and alternative agent thresholds; model rankings remain stable, supporting the robustness of the reported 61.8% figure and the conclusion that end-to-end development remains challenging. revision: yes
Referee: The abstract and methods description provide no information on how the 100 specifications were chosen, the criteria used to define the 964 workflows and 10,131 substeps, or the statistical procedures (including any multiple-comparison corrections) underlying the accuracy figures and Pearson correlation. These omissions make it impossible to assess whether the reported numbers are sensitive to arbitrary choices in benchmark construction.

Authors: We have substantially expanded both the abstract and Section 3 (Benchmark Construction) to describe the curation process: the 100 specifications were drawn from a larger pool of real-world-inspired tasks, selected for diversity across domains, complexity tiers, and browser-evaluable features. Workflows and substeps were produced via a hierarchical decomposition protocol with explicit criteria and expert validation. Accuracy is the simple proportion of fully successful workflows; the Pearson r=0.72 is a single pairwise correlation with no multiple-testing correction required. These details and a brief sensitivity note have been added to the revised text. revision: yes
Referee: The paper claims the browser agent determines spec compliance without additional human judgment, yet the low pairwise agreement directly contradicts the assumption that the agent's verdicts are stable and objective. A sensitivity analysis showing how model rankings change under alternative agent policies or human-majority labels is required to support the 'frontier challenge' conclusion.

Authors: The agent's classification policy is fully deterministic and rule-based, producing identical verdicts on repeated runs; the observed human variability therefore reflects task difficulty rather than instability in the automated evaluator. Nevertheless, we have added the requested sensitivity analysis to the revised manuscript. Under both human-majority aggregation and several alternative agent policies, model orderings and the 61.8% headline result remain essentially unchanged, reinforcing that reliable zero-to-one web application development is still a frontier challenge. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark

full rationale

The paper introduces a benchmark of 100 web app specifications evaluated via 964 browser workflows on outputs from 16 models. Reported figures (61.8% test accuracy, Pearson r=0.72 for self-testing correlation) are direct counts and statistical correlations computed from the evaluation pipeline. No equations, fitted parameters renamed as predictions, self-citation chains for uniqueness, or ansatzes appear in the derivation. The human alignment study (31.8-93.6% agreement) is presented as a separate diagnostic rather than a load-bearing input that the main results reduce to by construction. The chain consists of specification-to-generation-to-execution measurement and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of browser-agent evaluation and the representativeness of the 100 specifications; no free parameters are introduced and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Autonomous browser workflows provide a sufficient and unbiased measure of application correctness.
This assumption directly supports the reported accuracy percentages and is invoked in the evaluation pipeline description.

pith-pipeline@v0.9.0 · 5515 in / 1305 out tokens · 58553 ms · 2026-05-15T15:56:28.161364+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across 16 frontier models, the best achieves 61.8% accuracy on the test split... self-testing during generation as a strong performance predictor (Pearson r=0.72)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

An automated evaluation pipeline using browser agents to test end-to-end workflows

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies
cs.MA 2026-05 conditional novelty 7.0

SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering qu...

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

Claude code

Anthropic. Claude code. https://claude.com/product/claude-code, 2024. Accessed: 2026-03-03

work page 2024
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint, 2021. URL https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Cursor: The AI-first code editor

Cursor. Cursor: The AI-first code editor. https://cursor.sh, 2024

work page 2024
[5]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI agents solve long-ho...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.16941 2025
[6]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. arXiv preprint, 2024. URL https://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real- world GitHub issues?arXiv preprint, 2024. doi: 10.48550/arXiv.2310.06770. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770 2024
[8]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649,

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint, 2024. URL https://arxiv.org/abs/2401.13649

work page arXiv 2024
[9]

Lovable. Lovable. https://lovable.dev, 2025. Accessed: 2026-03-02

work page 2025
[10]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint, 2025. URL https://arxiv.org/ abs/2601.11868

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Adding error bars to evals: A statistical approach to language model evaluations.arXiv preprint, 2024

Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations.arXiv preprint, 2024. URL https://arxiv.org/abs/2411.00640

work page arXiv 2024
[12]

SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance soft- ware engineering?arXiv preprint, 2025

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance soft- ware engineering?arXiv preprint, 2025. URL https://arxiv.org/abs/2502.12115

work page arXiv 2025
[13]

Browser use: Enable AI to control your browser

Magnus Müller and Gregor Zunić. Browser use: Enable AI to control your browser. https://github.com/browser-use/browser-use, 2024. Accessed: 2026-03- 03

work page 2024
[14]

Introducing SWE-bench verified

OpenAI. Introducing SWE-bench verified. https://openai.com/index/introducing- swe-bench-verified/, 2024. Accessed: 2026-03-03

work page 2024
[15]

OpenAI. Codex. https://developers.openai.com/codex, 2025. Accessed: 2026-03- 02

work page 2025
[16]

The impact of AI on developer productivity: Evidence from GitHub Copilot.arXiv preprint,

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity: Evidence from GitHub Copilot.arXiv preprint,

work page
[17]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

doi: 10.48550/arXiv.2302.06590. URL https://arxiv.org/abs/2302.06590

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.06590
[18]

Replit agent

Replit. Replit agent. https://replit.com/products/agent, 2025. Accessed: 2026-03- 02

work page 2025
[19]

Developer survey 2024

Stack Overflow. Developer survey 2024. https://survey.stackoverflow.co/2024, 2024

work page 2024
[20]

Supabase

Supabase. Supabase. https://supabase.com, 2024. Accessed: 2026-02-27

work page 2024
[21]

SWE-Bench leaderboard

SWE-Bench. SWE-Bench leaderboard. https://www.swebench.com, 2026. Ac- cessed: 2026-02-27

work page 2026
[22]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jasber Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Ziyang Zhang, Botian Jiang, Yongliang Shen, Weiming Lu, Stephanie Lin, Yuqing Du, Wenhu Chen, and Graham Neubig. OpenHands: An open pl...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based software engineering agents.arXiv preprint, 2024. URL https://arxiv.org/abs/2407.01489

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint, 2024. URL https://arxiv.org/ abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. We- bArena: A realistic web environment for building autonomous agents.arXiv preprint, 2024. doi: 10.48550/arXiv.2307.13854. URL https://arxiv.org/abs/2307. 13854. Vibe Code Bench: Evaluating AI Models on End-to...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.13854 2024
[26]

generated - app

The target directory already exists ( e . g . , " generated - app ") . Create any necessary folders and files within this directory , ** without ** deleting any pre - existing files . - Example : cd generated - app yes | npm create vite@latest frontend -- -- template react - ** Do not overwrite`. env`** or any other files that the user has provided . . en...

work page
[27]

Keep the project ** minimal and lightweight **

work page
[28]

Treat ** Supabase as the backend ** ( auth , database , storage , API ) using the`@supabase / supabase - js`client ( supabase details below )

work page
[29]

Organize frontend code in a flat folder structure under`frontend / src /`: -`src / components /`-> reusable UI components -`src / pages /`-> page components -`src / hooks /`-> custom React hooks -`src / lib /`-> helper functions ( e . g . , Supabase client )

work page
[30]

Use ** React Router ** for navigation if multiple pages are required

work page
[31]

Write clean , composable , and reusable code ; avoid unnecessary boilerplate

work page
[32]

lazy`or dynamic imports - Do not modify Vite's default config unless necessary

Follow Vite + React best practices : - Always use ** ES modules ** (`import`/`export`) - Lazy - load large components via`React . lazy`or dynamic imports - Do not modify Vite's default config unless necessary

work page
[33]

missing relation

Try to use Tailwind CSS v3 . When scaffolding shadcn / ui , use`shadcn@2 .3.0`for compatibility . Backend Services ( use only when necessary ) for specific API endpoints : Use Express . js with Node . js for API endpoints that can't be handled by Supabase Examples of when a backend service is needed : Most likely : third - party API integrations that requ...

work page
[34]

** Create Checkout Session ** ( server - side API route ) - Link Stripe customer to Supabase user via metadata - Create payment record in Supabase with'pending'status - Set success_url and cancel_url pointing to your localhost app - Return session URL to client

work page
[35]

Is the docker daemon running ?

** Success Page Verification ** ( CRITICAL !) - User returns with session_id parameter - Server - side : Retrieve session from Stripe API to verify payment status - Update Supabase tables based on verified payment status - Never trust client - side data for payment fulfillment ** Implementation :** - Create API routes :`/ api / create - checkout`and`/ api...

work page
[36]

** Initial build ( and try before submitting ) **:`docker compose -p { project_id } up -d -- build`

work page
[37]

** After code changes **:`docker compose -p { project_id } restart < service >`

work page
[38]

yml change

** Only rebuild when **: dependencies or docker - compose . yml change

work page
[39]

** Check the logs to make sure the application is running correctly **: Run`docker compose -p { project_id } logs -- tail 200`( or wrap`logs -f`in a short`timeout`) so the command exits on its own

work page
[40]

b r o w s e r _ s c r e e n s h o t s /`) to verify functionality

** Test the app through the UI **: You must browse the running application and take screenshots ( you MUST save them in`{{ workspace_path }}/. b r o w s e r _ s c r e e n s h o t s /`) to verify functionality

work page
[41]

** Before submitting **: Run`docker compose -p { project_id } down -v`

work page
[42]

mock Stripe

** Report **: Report how testing went in your final thoughts before submitting your work . Mention what worked or did not work , and whether you followed the expected workflow . Report precisely any deployment errors or issues you encountered . ## Mandatory Testing - Browse the running application and take screenshots to verify functionality - Fix issues ...

work page
[43]

General Implementation - App overview and key features - Tech stack summary - Architecture approach

work page
[44]

md at project root in {{ workspace_path }}/ generated - app / , at most one page long , with :

Code Structure - Frontend / backend directory layout - Key files and their purposes - Database schema Second , a user - friendly USER_README . md at project root in {{ workspace_path }}/ generated - app / , at most one page long , with :

work page
[45]

missing service

Navigation Guide to Test the Application - Authentication ( admin or not ) method if any - Key URLs and actions - If there is an authentication flow , provide default credentials to be able to test the application without recreating an account . Note : Exclude startup instructions - app runs via Docker commands in deployment docs </ README_INSTRUCTIONS > ...

work page
[46]

- Members can set and update display name , bio , and profile image

Accounts and Profiles - Members can create accounts with username , email , and password . - Members can set and update display name , bio , and profile image . - Public profile pages show member identity fields and recent posts

work page
[47]

- Post content is limited to 280 characters

Posts - Members can create posts from the home feed . - Post content is limited to 280 characters . - Members can edit and delete only their own posts

work page
[48]

- Feed supports sorting by Newest and Trending

Feed and Discovery - Member home feed includes followed - member content and discovery content . - Feed supports sorting by Newest and Trending . - Feed supports search by keyword and hashtag

work page
[49]

- Members can like posts

Social Interactions - Members can follow other members . - Members can like posts . - Members can comment on posts

work page
[50]

Content and Constraints - Text - only posting experience for this version ( no media in posts )

In - App Notifications - Notifications page shows follow , like , and comment events relevant to the member . Content and Constraints - Text - only posting experience for this version ( no media in posts ) . - All key workflows must be executable in one browser session . Primary User Flows - Onboard and profile setup : sign up , complete profile fields , ...

work page