A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

Mohammad Khalil; Ronas Shakya; Sam Urmian

arxiv: 2503.13549 · v1 · pith:R7DSSKDPnew · submitted 2025-03-16 · 💻 cs.SE · cs.AI

A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

Ronas Shakya , Sam Urmian , Mohammad Khalil This is my paper

Pith reviewed 2026-05-23 00:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords ChatGPTDeepSeek-R1Codeforcesprogramming tasksLLM evaluationsuccess ratesdifficulty levelsAI coding assistance

0 comments

The pith

ChatGPT solves over half of medium Codeforces tasks while DeepSeek-R1 solves under one-fifth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests ChatGPT 03-mini and DeepSeek-R1 on 29 Codeforces programming problems split into easy, medium, and hard levels. It measures how often each model produces an accepted solution plus memory use and runtime. On easy tasks the two models perform similarly, yet on medium tasks ChatGPT reaches 54.5 percent success while DeepSeek-R1 reaches only 18.1 percent. Both models fail most hard tasks. Readers may care because the results point to which model is currently more useful for assisting with moderate programming work.

Core claim

The study shows that ChatGPT 03-mini and DeepSeek-R1 achieve comparable results on easy tasks, but ChatGPT attains a 54.5 percent success rate on medium-difficulty tasks against DeepSeek-R1's 18.1 percent success rate, with both models failing to solve most hard tasks when judged by accepted solutions, memory efficiency, and runtime performance.

What carries the argument

Comparison of success rates, memory, and runtime on 29 Codeforces tasks grouped by easy, medium, and hard difficulty using single-run prompting.

If this is right

Programmers may choose ChatGPT over DeepSeek-R1 when seeking help on medium-complexity problems.
Both models need further advances before they can reliably handle hard competitive programming tasks.
The observed gap in medium-task success highlights differences in how the two models handle moderate algorithmic complexity.
The results supply a concrete benchmark for tracking future improvements in LLM coding performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The medium-task gap could arise from differences in training data or internal architecture that the single evaluation does not isolate.
Testing a larger set of problems or additional models would show whether the 54.5-to-18.1 split holds more broadly.
Developers could combine the two models, using ChatGPT for medium tasks and another approach for hard ones.

Load-bearing premise

The 29 chosen tasks, the prompting method, and the single-run protocol give an unbiased picture of each model's programming ability.

What would settle it

Repeating the 29 tasks with varied prompts or multiple runs and obtaining a DeepSeek-R1 medium-task success rate near or above 54.5 percent would undermine the reported performance gap.

Figures

Figures reproduced from arXiv: 2503.13549 by Mohammad Khalil, Ronas Shakya, Sam Urmian.

**Figure 1.** Figure 1: Codeforces for competitive programming B. Study setting ChatGPT-03-mini was used to generate the code for ChatGPT, while DeepSeek-R1 model was used for DeepSeek. On Codeforces, multiple compiler options are available for C++ submissions, including versions of GNU G++, Clang, and MSVC. To ensure a fair comparison and uniform execution across all solutions, we selected GNU G++20 13.2 (64-bit, winlib) as the … view at source ↗

**Figure 2.** Figure 2: Example prompt for generating coding soluti [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Fig.3. The bar graph shows the average weighted sco [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Time comparison for 29 programming tasks usi [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Memory usage comparison for 29 programming t [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

The advancement of large language models (LLMs) has created a competitive landscape for AI-assisted programming tools. This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces. Using 29 programming tasks of three levels of easy, medium, and hard difficulty, we assessed the outcome of both models by their accepted solutions, memory efficiency, and runtime performance. Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks, achieving a 54.5% success rate compared to DeepSeek 18.1%. Both models struggled with hard tasks, thus highlighting some ongoing challenges LLMs face in handling highly complex programming problems. These findings highlight key differences in both model capabilities and their computational power, offering valuable insights for developers and researchers working to advance AI-driven programming tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 54.5% vs 18.1% medium-task gap rests on single runs of stochastic models with no variance or retry details, so it is too noisy to trust.

read the letter

The main point is that this paper gives a head-to-head on ChatGPT o3-mini and DeepSeek-R1 using 29 Codeforces tasks, but the headline numbers come from what appears to be one generation per task. That makes the reported gap on medium problems unreliable without any measure of variability or multiple samples per task. The abstract states both models handle easy tasks similarly, both fail on hard ones, and ChatGPT does better on medium, plus it tracks memory and runtime on accepted solutions. Those are straightforward empirical observations on real competitive-programming problems rather than synthetic benchmarks. The work is new in the narrow sense that these exact success rates for these two models on this task set have not been published before. It is useful as a quick snapshot for someone who needs to pick between these tools for coding assistance today. The methods section is the clear weak point. No information appears on the prompting strategy, temperature setting, number of trials allowed, or any statistical test. With only 29 tasks total and an unknown split by difficulty, modest per-task variance can easily flip the observed difference. The stress-test concern holds up on the abstract alone. This paper is for practitioners who want a current data point on these two models. A reader seeking solid evidence on LLM limits for programming would need clearer protocols and repeated runs before treating the numbers as stable. I would not bring it to a reading group or cite it. It does not look ready for peer review in its current form.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to evaluate ChatGPT o3-mini and DeepSeek-R1 on 29 Codeforces competitive programming tasks across easy, medium, and hard difficulty levels. It reports similar performance on easy tasks, a substantial advantage for ChatGPT on medium tasks (54.5% success rate versus 18.1% for DeepSeek), and poor performance by both models on hard tasks, with outcomes assessed via accepted solutions, memory efficiency, and runtime.

Significance. If the experimental protocol were fully documented with reproducible prompting details, multiple trials, and statistical analysis, the work would supply a direct empirical head-to-head comparison of two LLMs on competitive programming benchmarks. The reported medium-task gap could inform practical model selection in AI-assisted coding, but the current presentation leaves the central quantitative claims unsubstantiated.

major comments (2)

[Abstract] Abstract: The headline performance figures (54.5% versus 18.1% success on medium-difficulty tasks) are stated without any accompanying description of prompting strategy, number of generations per task, temperature settings, or statistical tests. Because LLMs are stochastic, these omissions make the reported gap impossible to verify or interpret as a reliable capability difference.
[Abstract] Abstract: The evaluation rests on a single generation per task across only 29 problems with an unspecified distribution across difficulty levels. No variance estimates or repeated-sampling protocol is mentioned, so even modest per-task stochasticity could reverse the observed medium-task gap; this single-run design is load-bearing for the central comparative claim.

minor comments (1)

[Abstract] Abstract: The model name appears as 'ChatGPT 03-mini'; this should be corrected to the standard 'o3-mini' designation for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our experimental protocol. We address each major comment below and will revise the manuscript accordingly to improve reproducibility and acknowledge limitations.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance figures (54.5% versus 18.1% success on medium-difficulty tasks) are stated without any accompanying description of prompting strategy, number of generations per task, temperature settings, or statistical tests. Because LLMs are stochastic, these omissions make the reported gap impossible to verify or interpret as a reliable capability difference.

Authors: We agree that the abstract (and current manuscript) omits these details. In revision we will add a Methods section describing the exact prompting strategy (zero-shot prompts with task-specific instructions), confirm a single generation per task, note default temperature settings, and state that no statistical tests were applied. This will allow readers to assess the reliability of the reported gap. revision: yes
Referee: [Abstract] Abstract: The evaluation rests on a single generation per task across only 29 problems with an unspecified distribution across difficulty levels. No variance estimates or repeated-sampling protocol is mentioned, so even modest per-task stochasticity could reverse the observed medium-task gap; this single-run design is load-bearing for the central comparative claim.

Authors: We acknowledge the single-generation design and small sample. The revision will explicitly report the task distribution (e.g., counts per difficulty), state that only one generation was used per problem, and discuss the implications of stochasticity as a limitation. We will also add a note recommending repeated sampling in future studies. The current results reflect the single-run outcomes obtained. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark with no derivations

full rationale

The paper is a straightforward empirical comparison of two LLMs on 29 Codeforces tasks, reporting observed success rates (e.g., 54.5% vs 18.1% on medium tasks) without any equations, fitted parameters, predictions, ansatzes, or self-citations that reduce claims to inputs by construction. All load-bearing steps are external measurements on independent model outputs, satisfying the criteria for a self-contained non-circular result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study rests on standard benchmarking assumptions rather than new mathematical structure; the main unstated premises concern fair prompting and representative task selection.

free parameters (1)

task selection
Choice of which 29 problems to include can influence the reported percentages and is not justified by external criteria in the abstract.

axioms (1)

domain assumption Models received equivalent prompts and evaluation conditions
Required for any valid head-to-head comparison but not described in the abstract.

pith-pipeline@v0.9.0 · 5685 in / 1167 out tokens · 70513 ms · 2026-05-23T00:21:18.722251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

[1]

StuGPTViz: A visual analytics approach to understa nd student- ChatGPT interactions,

Z. Chen, J. Wang, M. Xia, K. Shigyo, D. Liu, R. Zha ng, and H. Qu, “StuGPTViz: A visual analytics approach to understa nd student- ChatGPT interactions,” arXiv preprint arXiv:2407.12423 , 2024. [Online]. Available: https://arxiv.org/abs/2407.12423

work page arXiv 2024
[2]

DeepSeek vs. OpenAI: Comparing the New AI Titans,

V. Chugani, “DeepSeek vs. OpenAI: Comparing the New AI Titans,” DataCamp , Feb. 6, 2025. [Online]. Available: https://www.datacamp.com/blog/deepseek-vs-openai

work page 2025
[3]

Effect iveness of ChatGPT in coding: A comparative analysis of popular large language models,

C. E. Coello, M. N. Alimam, and R. Kouatly, “Effect iveness of ChatGPT in coding: A comparative analysis of popular large language models,” Digital , vol. 4, no. 1, pp. 114–125, 2024. doi: 10.3390/digital4010005

work page doi:10.3390/digital4010005 2024
[4]

DeepSeek vs ChatGPT vs Gemini: Choosing the Right AI for Your Needs,

Dirox , “DeepSeek vs ChatGPT vs Gemini: Choosing the Right AI for Your Needs,” Feb. 20, 2025. [Online]. Available: https://dirox.com/post/deepseek-vs-chatgpt-vs-gemini-ai-comparison

work page 2025
[5]

Token-Hungry, Yet Precise: DeepSeek R1 highlights the need for multi-step reasoning over speed in MATH,

E. Evstafev, “Token-Hungry, Yet Precise: DeepSeek R1 highlights the need for multi-step reasoning over speed in MATH,” arXiv preprint arXiv:2501.18576 , 2025. doi: 10.48550/arxiv.2501.18576

work page doi:10.48550/arxiv.2501.18576 2025
[6]

A Com parison of DeepSeek and Other LLMs,

T. Gao, J. Jin, Z. T. Ke, and G. Moryoussef, “A Com parison of DeepSeek and Other LLMs,” arXiv preprint arXiv:2502.03688 , 2025. doi: 10.48550/arxiv.2502.03688

work page doi:10.48550/arxiv.2502.03688 2025
[7]

China’s cheap, open AI model DeepSeek thrills scientists,

E. Gibney, “China’s cheap, open AI model DeepSeek thrills scientists,” Nature , pp. 13–14, 2025. doi: 10.1038/d41586-025-00229-6

work page doi:10.1038/d41586-025-00229-6 2025
[8]

Large language models must be taught to know what they don’t know,

S. Kapoor, “Large language models must be taught to know what they don’t know,” arXiv preprint arXiv:2406.08391v2 , 2023. [Online]. Available: https://arxiv.org/html/2406.08391v2

work page arXiv 2023
[9]

DeepSeek vs ChatGPT: Comparing Features i n 2025,

G. Kaur, “DeepSeek vs ChatGPT: Comparing Features i n 2025,” Cointelegraph , 2025. [Online]. Available: https://cointelegraph.com/learn/articles/deepseek-vs-chatgpt

work page 2025
[10]

Ericson, David Weintrop, and Tovi Grossman

M. Kazemitabaar et al., “How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment,” arXiv preprint arXiv:2309.14049 , 2023. doi: 10.48550/arXiv.2309.14049

work page doi:10.48550/arxiv.2309.14049 2023
[11]

Codeforces as an educational platform for learning programming in digitalization,

M. Mirzayanov et al., “Codeforces as an educational platform for learning programming in digitalization,” Olympiads in Informatics , pp. 133–142, 2020. doi: 10.15388/ioi.2020.10

work page doi:10.15388/ioi.2020.10 2020
[12]

How beginning programmers and code LLMs ( mis)read each other,

S. Nguyen, H. M. Babe, Y. Zi, A. Guha, C. J. Anders on, & M. Q. Feldman, “How beginning programmers and code LLMs ( mis)read each other,” arXiv preprint arXiv:3613904.3642706 , 2024. doi: 10.1145/3613904.3642706

work page doi:10.1145/3613904.3642706 2024
[13]

Chinese AI makes a strong showing: Dee pSeek-R1 outperforms ChatGPT in performance and efficiency,

A. Noriega, “Chinese AI makes a strong showing: Dee pSeek-R1 outperforms ChatGPT in performance and efficiency,” Driving ECO , Jan. 23, 2025. [Online]. Available: https://www.drivingeco.com/en/ia- china-pisa-fuerte-deepseek-r1-supera-chatgpt-rendimiento-eficiencia

work page 2025
[14]

Hello GPT-4o,

OpenAI, “Hello GPT-4o,” OpenAI , 2024. [Online]. Available: https://openai.com/index/hello-gpt-4o

work page 2024
[15]

OpenAI o3-mini,

OpenAI, “OpenAI o3-mini,” OpenAI , 2025. [Online]. Available: https://openai.com/index/openai-o3-mini

work page 2025
[16]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of AI on developer productivity: Evidence from GitHub Copilot,” arXiv preprint arXiv:2302.06590 , 2023. doi: 10.48550/arxiv.2302.06590

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.06590 2023
[17]

ChatGPT o3-mini-high: A leap forward in AI reasoning,

L. Perez, “ChatGPT o3-mini-high: A leap forward in AI reasoning,” Neuroflash , Feb. 5, 2025. [Online]. Available: https://neuroflash.com/blog/chatgpt-o3-mini-high

work page 2025
[18]

AI-assisted coding: Experiments with GPT-4,

R. A. Poldrack, T. Lu, and G. Beguš, “AI-assisted coding: Experiments with GPT-4,” arXiv preprint arXiv:2304.13187 , 2023. [Online]. Available: https://arxiv.org/abs/2304.13187

work page arXiv 2023
[19]

Exploring the use of ChatGPT as a tool for learning and assessment in undergraduate computer science curric ulum: Opportunities and challenges,

B. Qureshi, “Exploring the use of ChatGPT as a tool for learning and assessment in undergraduate computer science curric ulum: Opportunities and challenges,” arXiv preprint arXiv:2304.11214 ,

work page arXiv
[20]

doi: 10.48550/arxiv.2304.11214

work page doi:10.48550/arxiv.2304.11214
[21]

Industrial experience report on AI-assisted coding in professi onal software development,

R. Ramler, M. Moser, L. Fischer, M. Nissl, & R. Hei nzl, “Industrial experience report on AI-assisted coding in professi onal software development,” in Proc. 1st Int. Workshop Large Language Models for Code (LLM4Code ’24) , 2024, pp. 1–7. doi: 10.1145/3643795.3648377

work page doi:10.1145/3643795.3648377 2024
[22]

Test-case-driven programming understanding in large language models for better code generation,

Z. Tian and J. Chen, “Test-case-driven programming understanding in large language models for better code generation,” arXiv preprint arXiv:2309.16120 , 2023. doi: 10.48550/arxiv.2309.16120

work page doi:10.48550/arxiv.2309.16120 2023
[23]

Bridging novice programmers and LLMs with interactivity,

T. Y. Yeh, K. Tran, G. Gao, T. Yu, W. O. Fong, T. Y . & Chen, “Bridging novice programmers and LLMs with interactivity,” in Proc. 56th ACM Tech. Symp. Comput. Sci. Educ. (SIGCSETS 2 025) , 2025, pp. 1295–1301. doi: 10.1145/3641554.3701867

work page doi:10.1145/3641554.3701867 2025
[24]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu .. . & W. Liang, “DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence,” arXiv preprint arXiv:2406.11931 , [n.d.]. [Online]. Available: https://arxiv.org/pdf/2406.11931

work page internal anchor Pith review Pith/arXiv arXiv
[25]

From idea to implementation : Evaluating the influence of large language models in software deve lopment—An opinion paper,

S. Yadav, A. M. Qureshi, A. Kaushik, S. Sharma, R. Loughran, S. Kazhuparambil, et al., "From idea to implementation : Evaluating the influence of large language models in software deve lopment—An opinion paper," arXiv preprint arXiv:2503.07450 , 2025

work page arXiv 2025
[26]

L2ceval: Evaluating language-to-code generation capabilities of large language models,

A. Ni, P. Yin, Y. Zhao, M. Riddell, T. Feng, R. Shen, et al., "L2ceval: Evaluating language-to-code generation capabilities of large language models," Trans. Assoc. Comput. Linguist ., vol. 12, pp. 1311–1329, 2024

work page 2024

[1] [1]

StuGPTViz: A visual analytics approach to understa nd student- ChatGPT interactions,

Z. Chen, J. Wang, M. Xia, K. Shigyo, D. Liu, R. Zha ng, and H. Qu, “StuGPTViz: A visual analytics approach to understa nd student- ChatGPT interactions,” arXiv preprint arXiv:2407.12423 , 2024. [Online]. Available: https://arxiv.org/abs/2407.12423

work page arXiv 2024

[2] [2]

DeepSeek vs. OpenAI: Comparing the New AI Titans,

V. Chugani, “DeepSeek vs. OpenAI: Comparing the New AI Titans,” DataCamp , Feb. 6, 2025. [Online]. Available: https://www.datacamp.com/blog/deepseek-vs-openai

work page 2025

[3] [3]

Effect iveness of ChatGPT in coding: A comparative analysis of popular large language models,

C. E. Coello, M. N. Alimam, and R. Kouatly, “Effect iveness of ChatGPT in coding: A comparative analysis of popular large language models,” Digital , vol. 4, no. 1, pp. 114–125, 2024. doi: 10.3390/digital4010005

work page doi:10.3390/digital4010005 2024

[4] [4]

DeepSeek vs ChatGPT vs Gemini: Choosing the Right AI for Your Needs,

Dirox , “DeepSeek vs ChatGPT vs Gemini: Choosing the Right AI for Your Needs,” Feb. 20, 2025. [Online]. Available: https://dirox.com/post/deepseek-vs-chatgpt-vs-gemini-ai-comparison

work page 2025

[5] [5]

Token-Hungry, Yet Precise: DeepSeek R1 highlights the need for multi-step reasoning over speed in MATH,

E. Evstafev, “Token-Hungry, Yet Precise: DeepSeek R1 highlights the need for multi-step reasoning over speed in MATH,” arXiv preprint arXiv:2501.18576 , 2025. doi: 10.48550/arxiv.2501.18576

work page doi:10.48550/arxiv.2501.18576 2025

[6] [6]

A Com parison of DeepSeek and Other LLMs,

T. Gao, J. Jin, Z. T. Ke, and G. Moryoussef, “A Com parison of DeepSeek and Other LLMs,” arXiv preprint arXiv:2502.03688 , 2025. doi: 10.48550/arxiv.2502.03688

work page doi:10.48550/arxiv.2502.03688 2025

[7] [7]

China’s cheap, open AI model DeepSeek thrills scientists,

E. Gibney, “China’s cheap, open AI model DeepSeek thrills scientists,” Nature , pp. 13–14, 2025. doi: 10.1038/d41586-025-00229-6

work page doi:10.1038/d41586-025-00229-6 2025

[8] [8]

Large language models must be taught to know what they don’t know,

S. Kapoor, “Large language models must be taught to know what they don’t know,” arXiv preprint arXiv:2406.08391v2 , 2023. [Online]. Available: https://arxiv.org/html/2406.08391v2

work page arXiv 2023

[9] [9]

DeepSeek vs ChatGPT: Comparing Features i n 2025,

G. Kaur, “DeepSeek vs ChatGPT: Comparing Features i n 2025,” Cointelegraph , 2025. [Online]. Available: https://cointelegraph.com/learn/articles/deepseek-vs-chatgpt

work page 2025

[10] [10]

Ericson, David Weintrop, and Tovi Grossman

M. Kazemitabaar et al., “How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment,” arXiv preprint arXiv:2309.14049 , 2023. doi: 10.48550/arXiv.2309.14049

work page doi:10.48550/arxiv.2309.14049 2023

[11] [11]

Codeforces as an educational platform for learning programming in digitalization,

M. Mirzayanov et al., “Codeforces as an educational platform for learning programming in digitalization,” Olympiads in Informatics , pp. 133–142, 2020. doi: 10.15388/ioi.2020.10

work page doi:10.15388/ioi.2020.10 2020

[12] [12]

How beginning programmers and code LLMs ( mis)read each other,

S. Nguyen, H. M. Babe, Y. Zi, A. Guha, C. J. Anders on, & M. Q. Feldman, “How beginning programmers and code LLMs ( mis)read each other,” arXiv preprint arXiv:3613904.3642706 , 2024. doi: 10.1145/3613904.3642706

work page doi:10.1145/3613904.3642706 2024

[13] [13]

Chinese AI makes a strong showing: Dee pSeek-R1 outperforms ChatGPT in performance and efficiency,

A. Noriega, “Chinese AI makes a strong showing: Dee pSeek-R1 outperforms ChatGPT in performance and efficiency,” Driving ECO , Jan. 23, 2025. [Online]. Available: https://www.drivingeco.com/en/ia- china-pisa-fuerte-deepseek-r1-supera-chatgpt-rendimiento-eficiencia

work page 2025

[14] [14]

Hello GPT-4o,

OpenAI, “Hello GPT-4o,” OpenAI , 2024. [Online]. Available: https://openai.com/index/hello-gpt-4o

work page 2024

[15] [15]

OpenAI o3-mini,

OpenAI, “OpenAI o3-mini,” OpenAI , 2025. [Online]. Available: https://openai.com/index/openai-o3-mini

work page 2025

[16] [16]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of AI on developer productivity: Evidence from GitHub Copilot,” arXiv preprint arXiv:2302.06590 , 2023. doi: 10.48550/arxiv.2302.06590

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.06590 2023

[17] [17]

ChatGPT o3-mini-high: A leap forward in AI reasoning,

L. Perez, “ChatGPT o3-mini-high: A leap forward in AI reasoning,” Neuroflash , Feb. 5, 2025. [Online]. Available: https://neuroflash.com/blog/chatgpt-o3-mini-high

work page 2025

[18] [18]

AI-assisted coding: Experiments with GPT-4,

R. A. Poldrack, T. Lu, and G. Beguš, “AI-assisted coding: Experiments with GPT-4,” arXiv preprint arXiv:2304.13187 , 2023. [Online]. Available: https://arxiv.org/abs/2304.13187

work page arXiv 2023

[19] [19]

Exploring the use of ChatGPT as a tool for learning and assessment in undergraduate computer science curric ulum: Opportunities and challenges,

B. Qureshi, “Exploring the use of ChatGPT as a tool for learning and assessment in undergraduate computer science curric ulum: Opportunities and challenges,” arXiv preprint arXiv:2304.11214 ,

work page arXiv

[20] [20]

doi: 10.48550/arxiv.2304.11214

work page doi:10.48550/arxiv.2304.11214

[21] [21]

Industrial experience report on AI-assisted coding in professi onal software development,

R. Ramler, M. Moser, L. Fischer, M. Nissl, & R. Hei nzl, “Industrial experience report on AI-assisted coding in professi onal software development,” in Proc. 1st Int. Workshop Large Language Models for Code (LLM4Code ’24) , 2024, pp. 1–7. doi: 10.1145/3643795.3648377

work page doi:10.1145/3643795.3648377 2024

[22] [22]

Test-case-driven programming understanding in large language models for better code generation,

Z. Tian and J. Chen, “Test-case-driven programming understanding in large language models for better code generation,” arXiv preprint arXiv:2309.16120 , 2023. doi: 10.48550/arxiv.2309.16120

work page doi:10.48550/arxiv.2309.16120 2023

[23] [23]

Bridging novice programmers and LLMs with interactivity,

T. Y. Yeh, K. Tran, G. Gao, T. Yu, W. O. Fong, T. Y . & Chen, “Bridging novice programmers and LLMs with interactivity,” in Proc. 56th ACM Tech. Symp. Comput. Sci. Educ. (SIGCSETS 2 025) , 2025, pp. 1295–1301. doi: 10.1145/3641554.3701867

work page doi:10.1145/3641554.3701867 2025

[24] [24]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu .. . & W. Liang, “DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence,” arXiv preprint arXiv:2406.11931 , [n.d.]. [Online]. Available: https://arxiv.org/pdf/2406.11931

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

From idea to implementation : Evaluating the influence of large language models in software deve lopment—An opinion paper,

S. Yadav, A. M. Qureshi, A. Kaushik, S. Sharma, R. Loughran, S. Kazhuparambil, et al., "From idea to implementation : Evaluating the influence of large language models in software deve lopment—An opinion paper," arXiv preprint arXiv:2503.07450 , 2025

work page arXiv 2025

[26] [26]

L2ceval: Evaluating language-to-code generation capabilities of large language models,

A. Ni, P. Yin, Y. Zhao, M. Riddell, T. Feng, R. Shen, et al., "L2ceval: Evaluating language-to-code generation capabilities of large language models," Trans. Assoc. Comput. Linguist ., vol. 12, pp. 1311–1329, 2024

work page 2024