pith. machine review for the scientific record. sign in

arxiv: 2603.04601 · v3 · submitted 2026-03-04 · 💻 cs.SE · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords AI code generationweb application developmentbenchmarkend-to-end evaluationfrontier modelsself-testingbrowser agentapplication correctness
0
0 comments X

The pith

AI models reach at most 61.8 percent success when building complete web applications from scratch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Vibe Code Bench to measure how well AI models can create working web applications from full specifications instead of isolated code snippets. It runs 16 frontier models through 100 specifications and checks the resulting deployed apps with an autonomous browser agent that executes 964 workflows. The top model passes 61.8 percent of the held-out test cases, and models that test their own output during generation score higher with a 0.72 correlation. The work also measures how much different evaluators disagree on whether an app meets its spec.

Core claim

Vibe Code Bench evaluates 16 frontier AI models on generating end-to-end web applications from 100 specifications. The best model achieves 61.8 percent accuracy on the test split when applications are checked by an autonomous browser agent executing 964 workflows with 10,131 substeps. Self-testing during generation correlates strongly with success at Pearson r=0.72, while human alignment studies show evaluator choice shifts pairwise agreement from 31.8 to 93.6 percent.

What carries the argument

The autonomous browser agent that runs defined workflows on deployed applications to verify they satisfy the original specifications through 10,131 substeps.

Load-bearing premise

The autonomous browser agent and defined workflows fully and accurately determine whether a generated application satisfies the original specification without missing functionality or requiring additional human judgment.

What would settle it

A model or method that achieves over 90 percent accuracy on the held-out test split when evaluated by the same autonomous browser agent and workflows.

Figures

Figures reproduced from arXiv: 2603.04601 by Alex Gu, Antoine Bigeard, Hung Tran, Langston Nashold, Rayan Krishnan.

Figure 1
Figure 1. Figure 1: Generation flow from natural-language specification to a runnable application artifact. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Automated evaluation flow from deployed app to workflow pass/fail scoring. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy–cost and accuracy–latency trade-offs. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance by Reasoning Effort (20 app subset). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Vision ablation (20 app subset) Vision Ablation. Disabling vision reduces accuracy for GPT-5.3- Codex and Gemini 3.1 Pro, but has little effect on Opus 4.6 Thinking in this run ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Aggregate trajectory action composition across all tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Application pass-rate histograms for all evaluated models (test split). [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Trajectory timeline by model on a single application ( [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows and 10,131 substeps. It evaluates 16 frontier models on end-to-end development of deployed applications using an autonomous browser agent, reporting a best-model test accuracy of 61.8%, a Pearson r=0.72 correlation between self-testing during generation and performance, and a human alignment study showing 31.8-93.6% pairwise step-level agreement. Contributions include the dataset/pipeline, model evaluation with cost/latency/error analysis, and the alignment protocol.

Significance. If the evaluation pipeline proves robust, the benchmark would fill a gap by measuring complete zero-to-one web application development rather than isolated tasks, providing concrete evidence that reliable end-to-end generation remains challenging even for frontier models. The self-testing correlation and human alignment results could inform future agent designs and evaluation practices.

major comments (3)
  1. [Human alignment study and evaluation pipeline] The central performance claims (61.8% test accuracy and r=0.72 self-testing correlation) rest on the autonomous browser agent's classifications of whether generated applications satisfy the original specifications. The human alignment study reports only 31.8-93.6% pairwise step-level agreement, indicating that different evaluators reach materially different verdicts; without a detailed breakdown of disagreement cases, their effect on model rankings, or an inter-rater reliability statistic (e.g., Fleiss' kappa), it is unclear whether the agent's policy systematically under- or over-counts missing functionality.
  2. [Abstract and § on benchmark construction] The abstract and methods description provide no information on how the 100 specifications were chosen, the criteria used to define the 964 workflows and 10,131 substeps, or the statistical procedures (including any multiple-comparison corrections) underlying the accuracy figures and Pearson correlation. These omissions make it impossible to assess whether the reported numbers are sensitive to arbitrary choices in benchmark construction.
  3. [Evaluation methodology] The paper claims the browser agent determines spec compliance without additional human judgment, yet the low pairwise agreement directly contradicts the assumption that the agent's verdicts are stable and objective. A sensitivity analysis showing how model rankings change under alternative agent policies or human-majority labels is required to support the 'frontier challenge' conclusion.
minor comments (2)
  1. [Abstract] The abstract states 'completed human alignment study' but the main text should explicitly state the number of human annotators, their expertise, and the exact protocol for step-level labeling.
  2. [Results section] Cost and latency results are mentioned but not tied to specific model identifiers or hardware; a table linking these metrics to the 16 models would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on Vibe Code Bench. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central performance claims (61.8% test accuracy and r=0.72 self-testing correlation) rest on the autonomous browser agent's classifications of whether generated applications satisfy the original specifications. The human alignment study reports only 31.8-93.6% pairwise step-level agreement, indicating that different evaluators reach materially different verdicts; without a detailed breakdown of disagreement cases, their effect on model rankings, or an inter-rater reliability statistic (e.g., Fleiss' kappa), it is unclear whether the agent's policy systematically under- or over-counts missing functionality.

    Authors: We agree that the alignment study reveals meaningful variability in human judgments, which we included precisely to illustrate the difficulty of evaluating complex, deployed web applications. In the revised manuscript we have added: (1) a case-by-case breakdown of the disagreement instances and their distribution across models, (2) the effect of those disagreements on final model rankings, and (3) Fleiss' kappa computed over the full set of step-level annotations. We also performed the requested sensitivity analysis using human-majority labels and alternative agent thresholds; model rankings remain stable, supporting the robustness of the reported 61.8% figure and the conclusion that end-to-end development remains challenging. revision: yes

  2. Referee: The abstract and methods description provide no information on how the 100 specifications were chosen, the criteria used to define the 964 workflows and 10,131 substeps, or the statistical procedures (including any multiple-comparison corrections) underlying the accuracy figures and Pearson correlation. These omissions make it impossible to assess whether the reported numbers are sensitive to arbitrary choices in benchmark construction.

    Authors: We have substantially expanded both the abstract and Section 3 (Benchmark Construction) to describe the curation process: the 100 specifications were drawn from a larger pool of real-world-inspired tasks, selected for diversity across domains, complexity tiers, and browser-evaluable features. Workflows and substeps were produced via a hierarchical decomposition protocol with explicit criteria and expert validation. Accuracy is the simple proportion of fully successful workflows; the Pearson r=0.72 is a single pairwise correlation with no multiple-testing correction required. These details and a brief sensitivity note have been added to the revised text. revision: yes

  3. Referee: The paper claims the browser agent determines spec compliance without additional human judgment, yet the low pairwise agreement directly contradicts the assumption that the agent's verdicts are stable and objective. A sensitivity analysis showing how model rankings change under alternative agent policies or human-majority labels is required to support the 'frontier challenge' conclusion.

    Authors: The agent's classification policy is fully deterministic and rule-based, producing identical verdicts on repeated runs; the observed human variability therefore reflects task difficulty rather than instability in the automated evaluator. Nevertheless, we have added the requested sensitivity analysis to the revised manuscript. Under both human-majority aggregation and several alternative agent policies, model orderings and the 61.8% headline result remain essentially unchanged, reinforcing that reliable zero-to-one web application development is still a frontier challenge. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark

full rationale

The paper introduces a benchmark of 100 web app specifications evaluated via 964 browser workflows on outputs from 16 models. Reported figures (61.8% test accuracy, Pearson r=0.72 for self-testing correlation) are direct counts and statistical correlations computed from the evaluation pipeline. No equations, fitted parameters renamed as predictions, self-citation chains for uniqueness, or ansatzes appear in the derivation. The human alignment study (31.8-93.6% agreement) is presented as a separate diagnostic rather than a load-bearing input that the main results reduce to by construction. The chain consists of specification-to-generation-to-execution measurement and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of browser-agent evaluation and the representativeness of the 100 specifications; no free parameters are introduced and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Autonomous browser workflows provide a sufficient and unbiased measure of application correctness.
    This assumption directly supports the reported accuracy percentages and is invoked in the evaluation pipeline description.

pith-pipeline@v0.9.0 · 5515 in / 1305 out tokens · 58553 ms · 2026-05-15T15:56:28.161364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

    cs.MA 2026-05 conditional novelty 7.0

    SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering qu...

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    Claude code

    Anthropic. Claude code. https://claude.com/product/claude-code, 2024. Accessed: 2026-03-03

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint, 2021. URL https://arxiv.org/abs/2108.07732

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint, 2021. URL https://arxiv.org/abs/2107.03374

  4. [4]

    Cursor: The AI-first code editor

    Cursor. Cursor: The AI-first code editor. https://cursor.sh, 2024

  5. [5]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI agents solve long-ho...

  6. [6]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. arXiv preprint, 2024. URL https://arxiv.org/abs/2403.07974

  7. [7]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real- world GitHub issues?arXiv preprint, 2024. doi: 10.48550/arXiv.2310.06770. URL https://arxiv.org/abs/2310.06770

  8. [8]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649,

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint, 2024. URL https://arxiv.org/abs/2401.13649

  9. [9]

    Lovable. Lovable. https://lovable.dev, 2025. Accessed: 2026-03-02

  10. [10]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint, 2025. URL https://arxiv.org/ abs/2601.11868

  11. [11]

    Adding error bars to evals: A statistical approach to language model evaluations.arXiv preprint, 2024

    Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations.arXiv preprint, 2024. URL https://arxiv.org/abs/2411.00640

  12. [12]

    SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance soft- ware engineering?arXiv preprint, 2025

    Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. SWE-Lancer: Can frontier LLMs earn $1 million from real-world freelance soft- ware engineering?arXiv preprint, 2025. URL https://arxiv.org/abs/2502.12115

  13. [13]

    Browser use: Enable AI to control your browser

    Magnus Müller and Gregor Zunić. Browser use: Enable AI to control your browser. https://github.com/browser-use/browser-use, 2024. Accessed: 2026-03- 03

  14. [14]

    Introducing SWE-bench verified

    OpenAI. Introducing SWE-bench verified. https://openai.com/index/introducing- swe-bench-verified/, 2024. Accessed: 2026-03-03

  15. [15]

    OpenAI. Codex. https://developers.openai.com/codex, 2025. Accessed: 2026-03- 02

  16. [16]

    The impact of AI on developer productivity: Evidence from GitHub Copilot.arXiv preprint,

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity: Evidence from GitHub Copilot.arXiv preprint,

  17. [17]

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

    doi: 10.48550/arXiv.2302.06590. URL https://arxiv.org/abs/2302.06590

  18. [18]

    Replit agent

    Replit. Replit agent. https://replit.com/products/agent, 2025. Accessed: 2026-03- 02

  19. [19]

    Developer survey 2024

    Stack Overflow. Developer survey 2024. https://survey.stackoverflow.co/2024, 2024

  20. [20]

    Supabase

    Supabase. Supabase. https://supabase.com, 2024. Accessed: 2026-02-27

  21. [21]

    SWE-Bench leaderboard

    SWE-Bench. SWE-Bench leaderboard. https://www.swebench.com, 2026. Ac- cessed: 2026-02-27

  22. [22]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jasber Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Ziyang Zhang, Botian Jiang, Yongliang Shen, Weiming Lu, Stephanie Lin, Yuqing Du, Wenhu Chen, and Graham Neubig. OpenHands: An open pl...

  23. [23]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based software engineering agents.arXiv preprint, 2024. URL https://arxiv.org/abs/2407.01489

  24. [24]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint, 2024. URL https://arxiv.org/ abs/2405.15793

  25. [25]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. We- bArena: A realistic web environment for building autonomous agents.arXiv preprint, 2024. doi: 10.48550/arXiv.2307.13854. URL https://arxiv.org/abs/2307. 13854. Vibe Code Bench: Evaluating AI Models on End-to...

  26. [26]

    generated - app

    The target directory already exists ( e . g . , " generated - app ") . Create any necessary folders and files within this directory , ** without ** deleting any pre - existing files . - Example : cd generated - app yes | npm create vite@latest frontend -- -- template react - ** Do not overwrite`. env`** or any other files that the user has provided . . en...

  27. [27]

    Keep the project ** minimal and lightweight **

  28. [28]

    Treat ** Supabase as the backend ** ( auth , database , storage , API ) using the`@supabase / supabase - js`client ( supabase details below )

  29. [29]

    Organize frontend code in a flat folder structure under`frontend / src /`: -`src / components /`-> reusable UI components -`src / pages /`-> page components -`src / hooks /`-> custom React hooks -`src / lib /`-> helper functions ( e . g . , Supabase client )

  30. [30]

    Use ** React Router ** for navigation if multiple pages are required

  31. [31]

    Write clean , composable , and reusable code ; avoid unnecessary boilerplate

  32. [32]

    lazy`or dynamic imports - Do not modify Vite's default config unless necessary

    Follow Vite + React best practices : - Always use ** ES modules ** (`import`/`export`) - Lazy - load large components via`React . lazy`or dynamic imports - Do not modify Vite's default config unless necessary

  33. [33]

    missing relation

    Try to use Tailwind CSS v3 . When scaffolding shadcn / ui , use`shadcn@2 .3.0`for compatibility . Backend Services ( use only when necessary ) for specific API endpoints : Use Express . js with Node . js for API endpoints that can't be handled by Supabase Examples of when a backend service is needed : Most likely : third - party API integrations that requ...

  34. [34]

    ** Create Checkout Session ** ( server - side API route ) - Link Stripe customer to Supabase user via metadata - Create payment record in Supabase with'pending'status - Set success_url and cancel_url pointing to your localhost app - Return session URL to client

  35. [35]

    Is the docker daemon running ?

    ** Success Page Verification ** ( CRITICAL !) - User returns with session_id parameter - Server - side : Retrieve session from Stripe API to verify payment status - Update Supabase tables based on verified payment status - Never trust client - side data for payment fulfillment ** Implementation :** - Create API routes :`/ api / create - checkout`and`/ api...

  36. [36]

    ** Initial build ( and try before submitting ) **:`docker compose -p { project_id } up -d -- build`

  37. [37]

    ** After code changes **:`docker compose -p { project_id } restart < service >`

  38. [38]

    yml change

    ** Only rebuild when **: dependencies or docker - compose . yml change

  39. [39]

    ** Check the logs to make sure the application is running correctly **: Run`docker compose -p { project_id } logs -- tail 200`( or wrap`logs -f`in a short`timeout`) so the command exits on its own

  40. [40]

    b r o w s e r _ s c r e e n s h o t s /`) to verify functionality

    ** Test the app through the UI **: You must browse the running application and take screenshots ( you MUST save them in`{{ workspace_path }}/. b r o w s e r _ s c r e e n s h o t s /`) to verify functionality

  41. [41]

    ** Before submitting **: Run`docker compose -p { project_id } down -v`

  42. [42]

    mock Stripe

    ** Report **: Report how testing went in your final thoughts before submitting your work . Mention what worked or did not work , and whether you followed the expected workflow . Report precisely any deployment errors or issues you encountered . ## Mandatory Testing - Browse the running application and take screenshots to verify functionality - Fix issues ...

  43. [43]

    General Implementation - App overview and key features - Tech stack summary - Architecture approach

  44. [44]

    md at project root in {{ workspace_path }}/ generated - app / , at most one page long , with :

    Code Structure - Frontend / backend directory layout - Key files and their purposes - Database schema Second , a user - friendly USER_README . md at project root in {{ workspace_path }}/ generated - app / , at most one page long , with :

  45. [45]

    missing service

    Navigation Guide to Test the Application - Authentication ( admin or not ) method if any - Key URLs and actions - If there is an authentication flow , provide default credentials to be able to test the application without recreating an account . Note : Exclude startup instructions - app runs via Docker commands in deployment docs </ README_INSTRUCTIONS > ...

  46. [46]

    - Members can set and update display name , bio , and profile image

    Accounts and Profiles - Members can create accounts with username , email , and password . - Members can set and update display name , bio , and profile image . - Public profile pages show member identity fields and recent posts

  47. [47]

    - Post content is limited to 280 characters

    Posts - Members can create posts from the home feed . - Post content is limited to 280 characters . - Members can edit and delete only their own posts

  48. [48]

    - Feed supports sorting by Newest and Trending

    Feed and Discovery - Member home feed includes followed - member content and discovery content . - Feed supports sorting by Newest and Trending . - Feed supports search by keyword and hashtag

  49. [49]

    - Members can like posts

    Social Interactions - Members can follow other members . - Members can like posts . - Members can comment on posts

  50. [50]

    Content and Constraints - Text - only posting experience for this version ( no media in posts )

    In - App Notifications - Notifications page shows follow , like , and comment events relevant to the member . Content and Constraints - Text - only posting experience for this version ( no media in posts ) . - All key workflows must be executable in one browser session . Primary User Flows - Onboard and profile setup : sign up , complete profile fields , ...