Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming
Pith reviewed 2026-07-02 22:31 UTC · model grok-4.3
The pith
Vibe coding replaces code syntax with natural language prompts for greenfield software tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. The paper develops an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.
What carries the argument
Evaluation suite for LLM proficiency on simple, isolated greenfield Python programming tasks
If this is right
- Strong results on the suite would indicate that natural language alone can handle isolated greenfield tasks and may scale to larger ones.
- The suite supplies a concrete method to compare different LLMs and prompts for vibe coding performance.
- Analysis of existing benchmarks reveals which ones actually test the removal of syntax knowledge.
- If the approach succeeds here, it supports the historical claim that each new abstraction layer reduces the need for explicit syntax.
Where Pith is reading between the lines
- Vibe coding could change who can create software by removing the requirement to learn syntax first.
- Education systems might shift focus from syntax mastery to problem description and verification skills.
- The same evaluation approach could be extended to other languages or to tasks with dependencies between components.
Load-bearing premise
Performance on simple, isolated greenfield programming tasks in Python provides scoped but meaningful insight into the viability of vibe coding for broader greenfield software engineering tasks.
What would settle it
An experiment showing that LLMs produce mostly incorrect or incomplete code on the evaluation suite's simple Python tasks when given only natural language descriptions would indicate that vibe coding does not yet work even at this scoped level.
Figures
read the original abstract
Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines 'vibe coding' as natural-language-only programming that eliminates code syntax, claims this represents the endpoint of high-level abstraction, and presents a new evaluation suite of LLM performance on simple, isolated greenfield Python programming tasks as providing scoped but meaningful insight into the viability of vibe coding for broader greenfield software engineering.
Significance. If the evaluation suite were shown to be representative and the results robust, the work could supply early empirical data on natural-language-driven coding capabilities, helping ground discussions of AI-assisted development. The explicit scoping to isolated Python tasks is noted, but the absence of any argument linking narrow-task performance to realistic multi-module or cross-language greenfield work limits the potential impact.
major comments (2)
- [Abstract] Abstract: the central claim that results on 'simple, isolated greenfield programming tasks in Python' supply 'scoped but meaningful insight' into vibe coding viability for 'broader greenfield software engineering tasks' is load-bearing, yet no justification, correlation evidence, or discussion of relevant failure modes (multi-module integration, dependency management, non-Python contexts, or end-to-end natural-language workflows) is supplied. This leaves the evaluation design at risk of measuring a different capability than advertised.
- [Abstract] Abstract and introduction: the manuscript provides no description of task design, metrics, model selection, baselines, or statistical analysis. Without these details the soundness of the evaluation suite cannot be assessed and the contribution cannot be evaluated against the stated goal.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that results on 'simple, isolated greenfield programming tasks in Python' supply 'scoped but meaningful insight' into vibe coding viability for 'broader greenfield software engineering tasks' is load-bearing, yet no justification, correlation evidence, or discussion of relevant failure modes (multi-module integration, dependency management, non-Python contexts, or end-to-end natural-language workflows) is supplied. This leaves the evaluation design at risk of measuring a different capability than advertised.
Authors: We agree that the abstract asserts scoped insight without supplying explicit justification or discussion of failure modes. The evaluation was designed as a controlled baseline for isolated natural-language-to-Python tasks, but the manuscript does not articulate why this baseline is informative for broader greenfield work or address integration and cross-language issues. We will revise the introduction to add a limitations subsection that explains the scoping rationale, notes the absence of correlation evidence, and discusses relevant failure modes such as multi-module integration and dependency management. revision: yes
-
Referee: [Abstract] Abstract and introduction: the manuscript provides no description of task design, metrics, model selection, baselines, or statistical analysis. Without these details the soundness of the evaluation suite cannot be assessed and the contribution cannot be evaluated against the stated goal.
Authors: The referee is correct that the abstract and introduction contain no methodological details on task design, metrics, models, baselines, or analysis. The current manuscript text is limited to high-level motivation. We will revise both sections to include concise summaries of the evaluation suite (task prompts, correctness metrics via unit tests, selected LLMs, any baselines, and statistical methods), enabling readers to evaluate soundness. revision: yes
Circularity Check
No circularity: empirical evaluation with no derivations or self-referential reductions
full rationale
The paper frames its contribution as the creation and application of an empirical evaluation suite for LLM performance on simple, isolated greenfield Python tasks. No equations, parameter fittings, derivations, uniqueness theorems, or ansatzes are described. The central claim is scoped explicitly to the tasks evaluated and does not reduce to any input by construction, self-citation chain, or renaming of prior results. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Simple, isolated greenfield programming tasks in Python provide scoped insight into vibe coding viability for software engineering.
Reference graph
Works this paper leans on
-
[1]
Bistarelli, M
S. Bistarelli, M. Fiore, I. Mercanti, and M. Mongiello. Usage of Large Language Model for Code Generation Tasks: A Review,
-
[2]
URL https://link.springer.com/article/10.1007/s42979-0 25-04241-5
-
[3]
F. P. Brooks.No Silver Bullet. IEEE Computer, 10662 Los Vaqueros Cir, Los Alamitos, CA 90720, 1986
1986
- [4]
-
[5]
M. Chow and O. Ng. From technology adopters to creators: Leveraging AI-assisted vibe coding to transform clinical teach- ing and learning, 2025. URL https://www.tandfonline.com/doi/ pdf/10.1080/0142159X.2025.2488353
-
[6]
T. Claburn. AI benchmarks are a bad joke – and LLM makers are the ones laughing, 2025. URL https://www.theregister.com/ 2025/11/07/measuring ai models hampered by/
2025
-
[7]
Farag ´o
D. Farag ´o. From Vibe to Vise Coding: Addressing the AI- Generated Code Quality Crisis, 2025. URL https://dl.gi.de/ser ver/api/core/bitstreams/f8fe787e-2367-4e40-a79c-ee2f183193d e/content
2025
- [8]
-
[9]
G. Foster. How to write better prompts for AI code generation,
-
[10]
URL https://www.graphite.com/guides/better-prompts-a i-code
-
[11]
V . Garousi, M. Felderer, and M. V . M ¨antyl¨a. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering, 2017. URL https://arxiv.org/ pdf/1707.02553
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo. A Survey on LLM-as-a-Judge, 2024. URL https: //arxiv.org/pdf/2411.15594
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Jayakumar
H. Jayakumar. The Rise And Fall Of Vibe Coding: The Reality Of AI Slop, 2025. URL https://www.youtube.com/watch?v=vH PpBZiR80c
2025
-
[14]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github Issues?, 2024. URL https://openreview.net /forum?id=VTF8yNQM66
2024
- [15]
-
[16]
Knobel and N
C. Knobel and N. Radziwill. Vibe Coding: Is Human Nature the Ghost in the Machine?, 2025. URL https://arxiv.org/pdf/25 08.20918
2025
-
[17]
W. Li, X. Zhang, Z. Guo, S. Mao, W. Luo, G. Peng, Y . Huang, H. Wang, and S. Li. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation,
- [18]
-
[19]
J. Liu, C. S. Xia, Y . Wang, and L. Zhang. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, 2023. URL https://openreview.net/pdf?id=1qvx610Cu7
2023
-
[20]
S. H. Maes. The Gotchas of AI Coding and Vibe Coding. It’s All About Support And Maintenance, 2025. URL https: //www.researchgate.net/profile/Stephane-Maes-2/publication/3 91568491 The Gotchas of AI Coding and Vibe Coding I t%27s All About Support And Maintenance/links/6832a3e76 b5a287c3044caeb/The-Gotchas-of-AI-Coding-and-Vibe-Codin g-It%27s-All-About-Su...
2025
-
[21]
S. H. Maes. Ensuring the Maintainability and Supportability of ‘Vibe-Coded’ Software Systems: A Framework for Bridging Intuition and Engineering Rigor, 2025. URL https://www.rese archgate.net/profile/Stephane-Maes-2/publication/391491700 Ensuring the Maintainability and Supportability of Vibe-C oded Software Systems A Framework for Bridging Intuiti on and...
-
[22]
Monsanto
B. Monsanto. AI is fixing coding typos, but creating ‘time- bombs’: report, 2025. URL https://www.itbrew.com/stories/2 025/09/05/ai-is-fixing-coding-typos-but-creating-timebombs-r eport
2025
-
[23]
A. Moss. Vibe coding: What IT leaders need to know, 2025. URL https://www.techtarget.com/searchCIO/feature/Vibe-cod ing-What-IT-leaders-need-to-know
2025
-
[24]
J. C. Palazzo. Andrej Karpathy’s Guide to Vibe Coding, 2025. URL https://johncpalazzo.substack.com/p/andrej-karpathys-gui de-to-vibe-coding
2025
-
[25]
D. L. Parnas. Software Aspects of Strategic Defense Systems,
-
[26]
URL https://web.stanford.edu/class/cs99r/readings/parn as1.pdf
-
[27]
L. Ropek. After AI Led to Layoffs, Coders Are Being Hired to Fix ‘Vibe-Coded’ Screwups, 2025. URL https://gizmodo.co m/after-ai-led-to-layoffs-coders-are-being-hired-to-fix-vibe-c oded-screwups-2000657915
2025
-
[28]
Samsyudin
I. Samsyudin. Vibe Coding and AI-Led Conversational Pro- gramming: Emerging Trends in Software Development, 2025. URL https://papers.ssrn.com/sol3/papers.cfm?abstract id=5469 10 367
2025
-
[29]
Brownfield vs
Synoptek. Brownfield vs. Greenfield Development: What’s the Difference in Software?, 2018. URL https://synoptek.com/ins ights/it-blogs/greenfield-vs-brownfield-software-development/
2018
-
[30]
D. S. Wreden. Why DougDoug’s code always works, 2025. URL https://www.youtube.com/watch?v=L1GPLcBqljE&t=35 8s
2025
-
[31]
B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT, 2023. URL https://arxiv.org/abs/2304.10778
- [32]
-
[33]
S. Zhao, M. Hong, Y . Liu, D. Hazarika, and K. Lin. Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs, 2025. URL https://openrevi ew.net/pdf?id=QWunLKbBGF. IX. Appendix The early stages of the project were spent gathering as many sources as possible and combining their findings with each other to formulate new theori...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.