pith. sign in

arxiv: 2606.18293 · v2 · pith:7BRWFEEJnew · submitted 2026-06-15 · 💻 cs.SE · cs.AI

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

Pith reviewed 2026-07-02 22:31 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords vibe codingnatural language programminggreenfield software engineeringLLM evaluationAI codingprogramming abstraction
0
0 comments X

The pith

Vibe coding replaces code syntax with natural language prompts for greenfield software tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates the viability of vibe coding, which uses natural language to build applications without any underlying knowledge of code syntax. It develops an evaluation suite that tests large language models on simple, isolated greenfield Python programming tasks to gain scoped insight into broader software engineering use. A sympathetic reader would care because the approach could mark the final step in the long trend toward higher-level abstractions in programming. The work also examines existing benchmarks for measuring AI performance on these tasks.

Core claim

Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. The paper develops an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

What carries the argument

Evaluation suite for LLM proficiency on simple, isolated greenfield Python programming tasks

If this is right

  • Strong results on the suite would indicate that natural language alone can handle isolated greenfield tasks and may scale to larger ones.
  • The suite supplies a concrete method to compare different LLMs and prompts for vibe coding performance.
  • Analysis of existing benchmarks reveals which ones actually test the removal of syntax knowledge.
  • If the approach succeeds here, it supports the historical claim that each new abstraction layer reduces the need for explicit syntax.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Vibe coding could change who can create software by removing the requirement to learn syntax first.
  • Education systems might shift focus from syntax mastery to problem description and verification skills.
  • The same evaluation approach could be extended to other languages or to tasks with dependencies between components.

Load-bearing premise

Performance on simple, isolated greenfield programming tasks in Python provides scoped but meaningful insight into the viability of vibe coding for broader greenfield software engineering tasks.

What would settle it

An experiment showing that LLMs produce mostly incorrect or incomplete code on the evaluation suite's simple Python tasks when given only natural language descriptions would indicate that vibe coding does not yet work even at this scoped level.

Figures

Figures reproduced from arXiv: 2606.18293 by Callum Barbour.

Figure 1
Figure 1. Figure 1: Evaluation pipeline for the greenfield benchmarking system. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Manual audit results. These readings showcase several false negatives, particularly concen￾trated in task 5 while there are no false positives. This suggests that our scoring model is somewhat conservative. It is possible these false negatives were derived from the scorer being uncertain how to handle irrelevant data that may leak into the result.json’s variable list. One of the false negatives concerned G… view at source ↗
read the original abstract

Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper defines 'vibe coding' as natural-language-only programming that eliminates code syntax, claims this represents the endpoint of high-level abstraction, and presents a new evaluation suite of LLM performance on simple, isolated greenfield Python programming tasks as providing scoped but meaningful insight into the viability of vibe coding for broader greenfield software engineering.

Significance. If the evaluation suite were shown to be representative and the results robust, the work could supply early empirical data on natural-language-driven coding capabilities, helping ground discussions of AI-assisted development. The explicit scoping to isolated Python tasks is noted, but the absence of any argument linking narrow-task performance to realistic multi-module or cross-language greenfield work limits the potential impact.

major comments (2)
  1. [Abstract] Abstract: the central claim that results on 'simple, isolated greenfield programming tasks in Python' supply 'scoped but meaningful insight' into vibe coding viability for 'broader greenfield software engineering tasks' is load-bearing, yet no justification, correlation evidence, or discussion of relevant failure modes (multi-module integration, dependency management, non-Python contexts, or end-to-end natural-language workflows) is supplied. This leaves the evaluation design at risk of measuring a different capability than advertised.
  2. [Abstract] Abstract and introduction: the manuscript provides no description of task design, metrics, model selection, baselines, or statistical analysis. Without these details the soundness of the evaluation suite cannot be assessed and the contribution cannot be evaluated against the stated goal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that results on 'simple, isolated greenfield programming tasks in Python' supply 'scoped but meaningful insight' into vibe coding viability for 'broader greenfield software engineering tasks' is load-bearing, yet no justification, correlation evidence, or discussion of relevant failure modes (multi-module integration, dependency management, non-Python contexts, or end-to-end natural-language workflows) is supplied. This leaves the evaluation design at risk of measuring a different capability than advertised.

    Authors: We agree that the abstract asserts scoped insight without supplying explicit justification or discussion of failure modes. The evaluation was designed as a controlled baseline for isolated natural-language-to-Python tasks, but the manuscript does not articulate why this baseline is informative for broader greenfield work or address integration and cross-language issues. We will revise the introduction to add a limitations subsection that explains the scoping rationale, notes the absence of correlation evidence, and discusses relevant failure modes such as multi-module integration and dependency management. revision: yes

  2. Referee: [Abstract] Abstract and introduction: the manuscript provides no description of task design, metrics, model selection, baselines, or statistical analysis. Without these details the soundness of the evaluation suite cannot be assessed and the contribution cannot be evaluated against the stated goal.

    Authors: The referee is correct that the abstract and introduction contain no methodological details on task design, metrics, models, baselines, or analysis. The current manuscript text is limited to high-level motivation. We will revise both sections to include concise summaries of the evaluation suite (task prompts, correctness metrics via unit tests, selected LLMs, any baselines, and statistical methods), enabling readers to evaluate soundness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with no derivations or self-referential reductions

full rationale

The paper frames its contribution as the creation and application of an empirical evaluation suite for LLM performance on simple, isolated greenfield Python tasks. No equations, parameter fittings, derivations, uniqueness theorems, or ansatzes are described. The central claim is scoped explicitly to the tasks evaluated and does not reduce to any input by construction, self-citation chain, or renaming of prior results. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the domain assumption that simple isolated tasks can proxy greenfield software engineering; no free parameters, invented entities, or additional axioms are described in the abstract.

axioms (1)
  • domain assumption Simple, isolated greenfield programming tasks in Python provide scoped insight into vibe coding viability for software engineering.
    Invoked to justify the evaluation suite design and its relevance to the broader claim.

pith-pipeline@v0.9.1-grok · 5702 in / 1205 out tokens · 31575 ms · 2026-07-02T22:31:44.909366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    Bistarelli, M

    S. Bistarelli, M. Fiore, I. Mercanti, and M. Mongiello. Usage of Large Language Model for Code Generation Tasks: A Review,

  2. [2]

    URL https://link.springer.com/article/10.1007/s42979-0 25-04241-5

  3. [3]

    F. P. Brooks.No Silver Bullet. IEEE Computer, 10662 Los Vaqueros Cir, Los Alamitos, CA 90720, 1986

  4. [4]

    H. Chen, C. Li, and J. Li. FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding, 2025. URL https: //arxiv.org/pdf/2509.22237

  5. [5]

    Chow and O

    M. Chow and O. Ng. From technology adopters to creators: Leveraging AI-assisted vibe coding to transform clinical teach- ing and learning, 2025. URL https://www.tandfonline.com/doi/ pdf/10.1080/0142159X.2025.2488353

  6. [6]

    T. Claburn. AI benchmarks are a bad joke – and LLM makers are the ones laughing, 2025. URL https://www.theregister.com/ 2025/11/07/measuring ai models hampered by/

  7. [7]

    Farag ´o

    D. Farag ´o. From Vibe to Vise Coding: Addressing the AI- Generated Code Quality Crisis, 2025. URL https://dl.gi.de/ser ver/api/core/bitstreams/f8fe787e-2367-4e40-a79c-ee2f183193d e/content

  8. [8]

    Fawzy, A

    A. Fawzy, A. Tahir, and K. Blincoe. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook - a Grey Liter- ature Review, 2025. URL https://arxiv.org/pdf/2510.00328

  9. [9]

    G. Foster. How to write better prompts for AI code generation,

  10. [10]

    URL https://www.graphite.com/guides/better-prompts-a i-code

  11. [11]

    Guidelines for including grey literature and conducting multivocal literature reviews in software engineering

    V . Garousi, M. Felderer, and M. V . M ¨antyl¨a. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering, 2017. URL https://arxiv.org/ pdf/1707.02553

  12. [12]

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo. A Survey on LLM-as-a-Judge, 2024. URL https: //arxiv.org/pdf/2411.15594

  13. [13]

    Jayakumar

    H. Jayakumar. The Rise And Fall Of Vibe Coding: The Reality Of AI Slop, 2025. URL https://www.youtube.com/watch?v=vH PpBZiR80c

  14. [14]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github Issues?, 2024. URL https://openreview.net /forum?id=VTF8yNQM66

  15. [15]

    Khojah, F

    R. Khojah, F. G. de Oliveira Neto, M. Mohamad, and P. Leitner. The Impact of Prompt Programming on Function-Level Code Generation, 2024. URL https://arxiv.org/abs/2412.20545

  16. [16]

    Knobel and N

    C. Knobel and N. Radziwill. Vibe Coding: Is Human Nature the Ghost in the Machine?, 2025. URL https://arxiv.org/pdf/25 08.20918

  17. [17]

    W. Li, X. Zhang, Z. Guo, S. Mao, W. Luo, G. Peng, Y . Huang, H. Wang, and S. Li. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation,

  18. [18]

    URL https://arxiv.org/pdf/2503.06680

  19. [19]

    J. Liu, C. S. Xia, Y . Wang, and L. Zhang. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, 2023. URL https://openreview.net/pdf?id=1qvx610Cu7

  20. [20]

    S. H. Maes. The Gotchas of AI Coding and Vibe Coding. It’s All About Support And Maintenance, 2025. URL https: //www.researchgate.net/profile/Stephane-Maes-2/publication/3 91568491 The Gotchas of AI Coding and Vibe Coding I t%27s All About Support And Maintenance/links/6832a3e76 b5a287c3044caeb/The-Gotchas-of-AI-Coding-and-Vibe-Codin g-It%27s-All-About-Su...

  21. [21]

    S. H. Maes. Ensuring the Maintainability and Supportability of ‘Vibe-Coded’ Software Systems: A Framework for Bridging Intuition and Engineering Rigor, 2025. URL https://www.rese archgate.net/profile/Stephane-Maes-2/publication/391491700 Ensuring the Maintainability and Supportability of Vibe-C oded Software Systems A Framework for Bridging Intuiti on and...

  22. [22]

    Monsanto

    B. Monsanto. AI is fixing coding typos, but creating ‘time- bombs’: report, 2025. URL https://www.itbrew.com/stories/2 025/09/05/ai-is-fixing-coding-typos-but-creating-timebombs-r eport

  23. [23]

    A. Moss. Vibe coding: What IT leaders need to know, 2025. URL https://www.techtarget.com/searchCIO/feature/Vibe-cod ing-What-IT-leaders-need-to-know

  24. [24]

    J. C. Palazzo. Andrej Karpathy’s Guide to Vibe Coding, 2025. URL https://johncpalazzo.substack.com/p/andrej-karpathys-gui de-to-vibe-coding

  25. [25]

    D. L. Parnas. Software Aspects of Strategic Defense Systems,

  26. [26]

    URL https://web.stanford.edu/class/cs99r/readings/parn as1.pdf

  27. [27]

    L. Ropek. After AI Led to Layoffs, Coders Are Being Hired to Fix ‘Vibe-Coded’ Screwups, 2025. URL https://gizmodo.co m/after-ai-led-to-layoffs-coders-are-being-hired-to-fix-vibe-c oded-screwups-2000657915

  28. [28]

    Samsyudin

    I. Samsyudin. Vibe Coding and AI-Led Conversational Pro- gramming: Emerging Trends in Software Development, 2025. URL https://papers.ssrn.com/sol3/papers.cfm?abstract id=5469 10 367

  29. [29]

    Brownfield vs

    Synoptek. Brownfield vs. Greenfield Development: What’s the Difference in Software?, 2018. URL https://synoptek.com/ins ights/it-blogs/greenfield-vs-brownfield-software-development/

  30. [30]

    D. S. Wreden. Why DougDoug’s code always works, 2025. URL https://www.youtube.com/watch?v=L1GPLcBqljE&t=35 8s

  31. [31]

    Yetis ¸tiren, I

    B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT, 2023. URL https://arxiv.org/abs/2304.10778

  32. [32]

    Zhang, C

    Q. Zhang, C. Fang, Y . Shang, T. Zhang, S. Yu, and Z. Chen. No Man is an Island: Towards Fully Automatic Programming by Code Search, Code Generation and Program Repair, 2024. URL https://arxiv.org/pdf/2409.03267

  33. [33]

    S. Zhao, M. Hong, Y . Liu, D. Hazarika, and K. Lin. Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs, 2025. URL https://openrevi ew.net/pdf?id=QWunLKbBGF. IX. Appendix The early stages of the project were spent gathering as many sources as possible and combining their findings with each other to formulate new theori...