Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

Callum Barbour

arxiv: 2606.18293 · v2 · pith:7BRWFEEJnew · submitted 2026-06-15 · 💻 cs.SE · cs.AI

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

Callum Barbour This is my paper

Pith reviewed 2026-07-02 22:31 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords vibe codingnatural language programminggreenfield software engineeringLLM evaluationAI codingprogramming abstraction

0 comments

The pith

Vibe coding replaces code syntax with natural language prompts for greenfield software tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates the viability of vibe coding, which uses natural language to build applications without any underlying knowledge of code syntax. It develops an evaluation suite that tests large language models on simple, isolated greenfield Python programming tasks to gain scoped insight into broader software engineering use. A sympathetic reader would care because the approach could mark the final step in the long trend toward higher-level abstractions in programming. The work also examines existing benchmarks for measuring AI performance on these tasks.

Core claim

Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. The paper develops an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

What carries the argument

Evaluation suite for LLM proficiency on simple, isolated greenfield Python programming tasks

If this is right

Strong results on the suite would indicate that natural language alone can handle isolated greenfield tasks and may scale to larger ones.
The suite supplies a concrete method to compare different LLMs and prompts for vibe coding performance.
Analysis of existing benchmarks reveals which ones actually test the removal of syntax knowledge.
If the approach succeeds here, it supports the historical claim that each new abstraction layer reduces the need for explicit syntax.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Vibe coding could change who can create software by removing the requirement to learn syntax first.
Education systems might shift focus from syntax mastery to problem description and verification skills.
The same evaluation approach could be extended to other languages or to tasks with dependencies between components.

Load-bearing premise

Performance on simple, isolated greenfield programming tasks in Python provides scoped but meaningful insight into the viability of vibe coding for broader greenfield software engineering tasks.

What would settle it

An experiment showing that LLMs produce mostly incorrect or incomplete code on the evaluation suite's simple Python tasks when given only natural language descriptions would indicate that vibe coding does not yet work even at this scoped level.

Figures

Figures reproduced from arXiv: 2606.18293 by Callum Barbour.

**Figure 2.** Figure 2: Manual audit results. These readings showcase several false negatives, particularly concentrated in task 5 while there are no false positives. This suggests that our scoring model is somewhat conservative. It is possible these false negatives were derived from the scorer being uncertain how to handle irrelevant data that may leak into the result.json’s variable list. One of the false negatives concerned G… view at source ↗

read the original abstract

Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a new suite for testing LLMs on simple isolated Python greenfield tasks but the narrow scope undercuts claims about vibe coding viability for real software engineering.

read the letter

The main point is that this paper creates an evaluation suite for LLMs carrying out basic greenfield Python tasks from natural language prompts alone, framed as a way to test the practicality of vibe coding. It does not present results or detailed task lists in the abstract, so the work is mostly about the setup itself.

What is new is the focus on greenfield tasks instead of the usual bug-fixing or code-completion benchmarks. The authors correctly identify that vibe coding sits at the end of the abstraction ladder and try to measure LLM performance in that setting, which is a reasonable direction even if the execution stays limited.

The paper is clear about its scoped intent, which is fair. The soft spot is that the tasks are simple and isolated, so it is not obvious why performance on them would illuminate whether vibe coding can handle actual greenfield work. Real projects involve multiple files, dependencies, integration, testing, and often languages other than Python. No argument is given for why the chosen task distribution would capture the relevant failure modes, and the abstract supplies no evidence that the suite was designed with those broader conditions in mind.

This is mainly useful for researchers who maintain LLM coding benchmarks and want another narrow test case to add to the collection. Readers looking for evidence that could guide decisions about replacing syntax with natural language on non-trivial projects will not find it here.

I would send it to peer review once the full version shows the actual tasks and any initial runs, because new evaluation suites can still be worth referee time even when incremental. The generalization question needs to be addressed directly in revision.

Referee Report

2 major / 0 minor

Summary. The paper defines 'vibe coding' as natural-language-only programming that eliminates code syntax, claims this represents the endpoint of high-level abstraction, and presents a new evaluation suite of LLM performance on simple, isolated greenfield Python programming tasks as providing scoped but meaningful insight into the viability of vibe coding for broader greenfield software engineering.

Significance. If the evaluation suite were shown to be representative and the results robust, the work could supply early empirical data on natural-language-driven coding capabilities, helping ground discussions of AI-assisted development. The explicit scoping to isolated Python tasks is noted, but the absence of any argument linking narrow-task performance to realistic multi-module or cross-language greenfield work limits the potential impact.

major comments (2)

[Abstract] Abstract: the central claim that results on 'simple, isolated greenfield programming tasks in Python' supply 'scoped but meaningful insight' into vibe coding viability for 'broader greenfield software engineering tasks' is load-bearing, yet no justification, correlation evidence, or discussion of relevant failure modes (multi-module integration, dependency management, non-Python contexts, or end-to-end natural-language workflows) is supplied. This leaves the evaluation design at risk of measuring a different capability than advertised.
[Abstract] Abstract and introduction: the manuscript provides no description of task design, metrics, model selection, baselines, or statistical analysis. Without these details the soundness of the evaluation suite cannot be assessed and the contribution cannot be evaluated against the stated goal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that results on 'simple, isolated greenfield programming tasks in Python' supply 'scoped but meaningful insight' into vibe coding viability for 'broader greenfield software engineering tasks' is load-bearing, yet no justification, correlation evidence, or discussion of relevant failure modes (multi-module integration, dependency management, non-Python contexts, or end-to-end natural-language workflows) is supplied. This leaves the evaluation design at risk of measuring a different capability than advertised.

Authors: We agree that the abstract asserts scoped insight without supplying explicit justification or discussion of failure modes. The evaluation was designed as a controlled baseline for isolated natural-language-to-Python tasks, but the manuscript does not articulate why this baseline is informative for broader greenfield work or address integration and cross-language issues. We will revise the introduction to add a limitations subsection that explains the scoping rationale, notes the absence of correlation evidence, and discusses relevant failure modes such as multi-module integration and dependency management. revision: yes
Referee: [Abstract] Abstract and introduction: the manuscript provides no description of task design, metrics, model selection, baselines, or statistical analysis. Without these details the soundness of the evaluation suite cannot be assessed and the contribution cannot be evaluated against the stated goal.

Authors: The referee is correct that the abstract and introduction contain no methodological details on task design, metrics, models, baselines, or analysis. The current manuscript text is limited to high-level motivation. We will revise both sections to include concise summaries of the evaluation suite (task prompts, correctness metrics via unit tests, selected LLMs, any baselines, and statistical methods), enabling readers to evaluate soundness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with no derivations or self-referential reductions

full rationale

The paper frames its contribution as the creation and application of an empirical evaluation suite for LLM performance on simple, isolated greenfield Python tasks. No equations, parameter fittings, derivations, uniqueness theorems, or ansatzes are described. The central claim is scoped explicitly to the tasks evaluated and does not reduce to any input by construction, self-citation chain, or renaming of prior results. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the domain assumption that simple isolated tasks can proxy greenfield software engineering; no free parameters, invented entities, or additional axioms are described in the abstract.

axioms (1)

domain assumption Simple, isolated greenfield programming tasks in Python provide scoped insight into vibe coding viability for software engineering.
Invoked to justify the evaluation suite design and its relevance to the broader claim.

pith-pipeline@v0.9.1-grok · 5702 in / 1205 out tokens · 31575 ms · 2026-07-02T22:31:44.909366+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 11 canonical work pages · 2 internal anchors

[1]

Bistarelli, M

S. Bistarelli, M. Fiore, I. Mercanti, and M. Mongiello. Usage of Large Language Model for Code Generation Tasks: A Review,
[2]

URL https://link.springer.com/article/10.1007/s42979-0 25-04241-5

work page doi:10.1007/s42979-0
[3]

F. P. Brooks.No Silver Bullet. IEEE Computer, 10662 Los Vaqueros Cir, Los Alamitos, CA 90720, 1986

1986
[4]

H. Chen, C. Li, and J. Li. FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding, 2025. URL https: //arxiv.org/pdf/2509.22237

work page arXiv 2025
[5]

Chow and O

M. Chow and O. Ng. From technology adopters to creators: Leveraging AI-assisted vibe coding to transform clinical teach- ing and learning, 2025. URL https://www.tandfonline.com/doi/ pdf/10.1080/0142159X.2025.2488353

work page doi:10.1080/0142159x.2025.2488353 2025
[6]

T. Claburn. AI benchmarks are a bad joke – and LLM makers are the ones laughing, 2025. URL https://www.theregister.com/ 2025/11/07/measuring ai models hampered by/

2025
[7]

Farag ´o

D. Farag ´o. From Vibe to Vise Coding: Addressing the AI- Generated Code Quality Crisis, 2025. URL https://dl.gi.de/ser ver/api/core/bitstreams/f8fe787e-2367-4e40-a79c-ee2f183193d e/content

2025
[8]

Fawzy, A

A. Fawzy, A. Tahir, and K. Blincoe. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook - a Grey Liter- ature Review, 2025. URL https://arxiv.org/pdf/2510.00328

work page arXiv 2025
[9]

G. Foster. How to write better prompts for AI code generation,
[10]

URL https://www.graphite.com/guides/better-prompts-a i-code
[11]

Guidelines for including grey literature and conducting multivocal literature reviews in software engineering

V . Garousi, M. Felderer, and M. V . M ¨antyl¨a. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering, 2017. URL https://arxiv.org/ pdf/1707.02553

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo. A Survey on LLM-as-a-Judge, 2024. URL https: //arxiv.org/pdf/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Jayakumar

H. Jayakumar. The Rise And Fall Of Vibe Coding: The Reality Of AI Slop, 2025. URL https://www.youtube.com/watch?v=vH PpBZiR80c

2025
[14]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github Issues?, 2024. URL https://openreview.net /forum?id=VTF8yNQM66

2024
[15]

Khojah, F

R. Khojah, F. G. de Oliveira Neto, M. Mohamad, and P. Leitner. The Impact of Prompt Programming on Function-Level Code Generation, 2024. URL https://arxiv.org/abs/2412.20545

work page arXiv 2024
[16]

Knobel and N

C. Knobel and N. Radziwill. Vibe Coding: Is Human Nature the Ghost in the Machine?, 2025. URL https://arxiv.org/pdf/25 08.20918

2025
[17]

W. Li, X. Zhang, Z. Guo, S. Mao, W. Luo, G. Peng, Y . Huang, H. Wang, and S. Li. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation,
[18]

URL https://arxiv.org/pdf/2503.06680

work page arXiv
[19]

J. Liu, C. S. Xia, Y . Wang, and L. Zhang. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, 2023. URL https://openreview.net/pdf?id=1qvx610Cu7

2023
[20]

S. H. Maes. The Gotchas of AI Coding and Vibe Coding. It’s All About Support And Maintenance, 2025. URL https: //www.researchgate.net/profile/Stephane-Maes-2/publication/3 91568491 The Gotchas of AI Coding and Vibe Coding I t%27s All About Support And Maintenance/links/6832a3e76 b5a287c3044caeb/The-Gotchas-of-AI-Coding-and-Vibe-Codin g-It%27s-All-About-Su...

2025
[21]

S. H. Maes. Ensuring the Maintainability and Supportability of ‘Vibe-Coded’ Software Systems: A Framework for Bridging Intuition and Engineering Rigor, 2025. URL https://www.rese archgate.net/profile/Stephane-Maes-2/publication/391491700 Ensuring the Maintainability and Supportability of Vibe-C oded Software Systems A Framework for Bridging Intuiti on and...

work page arXiv 2025
[22]

Monsanto

B. Monsanto. AI is fixing coding typos, but creating ‘time- bombs’: report, 2025. URL https://www.itbrew.com/stories/2 025/09/05/ai-is-fixing-coding-typos-but-creating-timebombs-r eport

2025
[23]

A. Moss. Vibe coding: What IT leaders need to know, 2025. URL https://www.techtarget.com/searchCIO/feature/Vibe-cod ing-What-IT-leaders-need-to-know

2025
[24]

J. C. Palazzo. Andrej Karpathy’s Guide to Vibe Coding, 2025. URL https://johncpalazzo.substack.com/p/andrej-karpathys-gui de-to-vibe-coding

2025
[25]

D. L. Parnas. Software Aspects of Strategic Defense Systems,
[26]

URL https://web.stanford.edu/class/cs99r/readings/parn as1.pdf
[27]

L. Ropek. After AI Led to Layoffs, Coders Are Being Hired to Fix ‘Vibe-Coded’ Screwups, 2025. URL https://gizmodo.co m/after-ai-led-to-layoffs-coders-are-being-hired-to-fix-vibe-c oded-screwups-2000657915

2025
[28]

Samsyudin

I. Samsyudin. Vibe Coding and AI-Led Conversational Pro- gramming: Emerging Trends in Software Development, 2025. URL https://papers.ssrn.com/sol3/papers.cfm?abstract id=5469 10 367

2025
[29]

Brownfield vs

Synoptek. Brownfield vs. Greenfield Development: What’s the Difference in Software?, 2018. URL https://synoptek.com/ins ights/it-blogs/greenfield-vs-brownfield-software-development/

2018
[30]

D. S. Wreden. Why DougDoug’s code always works, 2025. URL https://www.youtube.com/watch?v=L1GPLcBqljE&t=35 8s

2025
[31]

Yetis ¸tiren, I

B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT, 2023. URL https://arxiv.org/abs/2304.10778

work page arXiv 2023
[32]

Zhang, C

Q. Zhang, C. Fang, Y . Shang, T. Zhang, S. Yu, and Z. Chen. No Man is an Island: Towards Fully Automatic Programming by Code Search, Code Generation and Program Repair, 2024. URL https://arxiv.org/pdf/2409.03267

work page arXiv 2024
[33]

S. Zhao, M. Hong, Y . Liu, D. Hazarika, and K. Lin. Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs, 2025. URL https://openrevi ew.net/pdf?id=QWunLKbBGF. IX. Appendix The early stages of the project were spent gathering as many sources as possible and combining their findings with each other to formulate new theori...

2025

[1] [1]

Bistarelli, M

S. Bistarelli, M. Fiore, I. Mercanti, and M. Mongiello. Usage of Large Language Model for Code Generation Tasks: A Review,

[2] [2]

URL https://link.springer.com/article/10.1007/s42979-0 25-04241-5

work page doi:10.1007/s42979-0

[3] [3]

F. P. Brooks.No Silver Bullet. IEEE Computer, 10662 Los Vaqueros Cir, Los Alamitos, CA 90720, 1986

1986

[4] [4]

H. Chen, C. Li, and J. Li. FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding, 2025. URL https: //arxiv.org/pdf/2509.22237

work page arXiv 2025

[5] [5]

Chow and O

M. Chow and O. Ng. From technology adopters to creators: Leveraging AI-assisted vibe coding to transform clinical teach- ing and learning, 2025. URL https://www.tandfonline.com/doi/ pdf/10.1080/0142159X.2025.2488353

work page doi:10.1080/0142159x.2025.2488353 2025

[6] [6]

T. Claburn. AI benchmarks are a bad joke – and LLM makers are the ones laughing, 2025. URL https://www.theregister.com/ 2025/11/07/measuring ai models hampered by/

2025

[7] [7]

Farag ´o

D. Farag ´o. From Vibe to Vise Coding: Addressing the AI- Generated Code Quality Crisis, 2025. URL https://dl.gi.de/ser ver/api/core/bitstreams/f8fe787e-2367-4e40-a79c-ee2f183193d e/content

2025

[8] [8]

Fawzy, A

A. Fawzy, A. Tahir, and K. Blincoe. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook - a Grey Liter- ature Review, 2025. URL https://arxiv.org/pdf/2510.00328

work page arXiv 2025

[9] [9]

G. Foster. How to write better prompts for AI code generation,

[10] [10]

URL https://www.graphite.com/guides/better-prompts-a i-code

[11] [11]

Guidelines for including grey literature and conducting multivocal literature reviews in software engineering

V . Garousi, M. Felderer, and M. V . M ¨antyl¨a. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering, 2017. URL https://arxiv.org/ pdf/1707.02553

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo. A Survey on LLM-as-a-Judge, 2024. URL https: //arxiv.org/pdf/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Jayakumar

H. Jayakumar. The Rise And Fall Of Vibe Coding: The Reality Of AI Slop, 2025. URL https://www.youtube.com/watch?v=vH PpBZiR80c

2025

[14] [14]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github Issues?, 2024. URL https://openreview.net /forum?id=VTF8yNQM66

2024

[15] [15]

Khojah, F

R. Khojah, F. G. de Oliveira Neto, M. Mohamad, and P. Leitner. The Impact of Prompt Programming on Function-Level Code Generation, 2024. URL https://arxiv.org/abs/2412.20545

work page arXiv 2024

[16] [16]

Knobel and N

C. Knobel and N. Radziwill. Vibe Coding: Is Human Nature the Ghost in the Machine?, 2025. URL https://arxiv.org/pdf/25 08.20918

2025

[17] [17]

W. Li, X. Zhang, Z. Guo, S. Mao, W. Luo, G. Peng, Y . Huang, H. Wang, and S. Li. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation,

[18] [18]

URL https://arxiv.org/pdf/2503.06680

work page arXiv

[19] [19]

J. Liu, C. S. Xia, Y . Wang, and L. Zhang. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, 2023. URL https://openreview.net/pdf?id=1qvx610Cu7

2023

[20] [20]

S. H. Maes. The Gotchas of AI Coding and Vibe Coding. It’s All About Support And Maintenance, 2025. URL https: //www.researchgate.net/profile/Stephane-Maes-2/publication/3 91568491 The Gotchas of AI Coding and Vibe Coding I t%27s All About Support And Maintenance/links/6832a3e76 b5a287c3044caeb/The-Gotchas-of-AI-Coding-and-Vibe-Codin g-It%27s-All-About-Su...

2025

[21] [21]

S. H. Maes. Ensuring the Maintainability and Supportability of ‘Vibe-Coded’ Software Systems: A Framework for Bridging Intuition and Engineering Rigor, 2025. URL https://www.rese archgate.net/profile/Stephane-Maes-2/publication/391491700 Ensuring the Maintainability and Supportability of Vibe-C oded Software Systems A Framework for Bridging Intuiti on and...

work page arXiv 2025

[22] [22]

Monsanto

B. Monsanto. AI is fixing coding typos, but creating ‘time- bombs’: report, 2025. URL https://www.itbrew.com/stories/2 025/09/05/ai-is-fixing-coding-typos-but-creating-timebombs-r eport

2025

[23] [23]

A. Moss. Vibe coding: What IT leaders need to know, 2025. URL https://www.techtarget.com/searchCIO/feature/Vibe-cod ing-What-IT-leaders-need-to-know

2025

[24] [24]

J. C. Palazzo. Andrej Karpathy’s Guide to Vibe Coding, 2025. URL https://johncpalazzo.substack.com/p/andrej-karpathys-gui de-to-vibe-coding

2025

[25] [25]

D. L. Parnas. Software Aspects of Strategic Defense Systems,

[26] [26]

URL https://web.stanford.edu/class/cs99r/readings/parn as1.pdf

[27] [27]

L. Ropek. After AI Led to Layoffs, Coders Are Being Hired to Fix ‘Vibe-Coded’ Screwups, 2025. URL https://gizmodo.co m/after-ai-led-to-layoffs-coders-are-being-hired-to-fix-vibe-c oded-screwups-2000657915

2025

[28] [28]

Samsyudin

I. Samsyudin. Vibe Coding and AI-Led Conversational Pro- gramming: Emerging Trends in Software Development, 2025. URL https://papers.ssrn.com/sol3/papers.cfm?abstract id=5469 10 367

2025

[29] [29]

Brownfield vs

Synoptek. Brownfield vs. Greenfield Development: What’s the Difference in Software?, 2018. URL https://synoptek.com/ins ights/it-blogs/greenfield-vs-brownfield-software-development/

2018

[30] [30]

D. S. Wreden. Why DougDoug’s code always works, 2025. URL https://www.youtube.com/watch?v=L1GPLcBqljE&t=35 8s

2025

[31] [31]

Yetis ¸tiren, I

B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT, 2023. URL https://arxiv.org/abs/2304.10778

work page arXiv 2023

[32] [32]

Zhang, C

Q. Zhang, C. Fang, Y . Shang, T. Zhang, S. Yu, and Z. Chen. No Man is an Island: Towards Fully Automatic Programming by Code Search, Code Generation and Program Repair, 2024. URL https://arxiv.org/pdf/2409.03267

work page arXiv 2024

[33] [33]

S. Zhao, M. Hong, Y . Liu, D. Hazarika, and K. Lin. Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs, 2025. URL https://openrevi ew.net/pdf?id=QWunLKbBGF. IX. Appendix The early stages of the project were spent gathering as many sources as possible and combining their findings with each other to formulate new theori...

2025