pith. sign in

arxiv: 2503.13549 · v1 · pith:R7DSSKDPnew · submitted 2025-03-16 · 💻 cs.SE · cs.AI

A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

Pith reviewed 2026-05-23 00:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords ChatGPTDeepSeek-R1Codeforcesprogramming tasksLLM evaluationsuccess ratesdifficulty levelsAI coding assistance
0
0 comments X

The pith

ChatGPT solves over half of medium Codeforces tasks while DeepSeek-R1 solves under one-fifth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests ChatGPT 03-mini and DeepSeek-R1 on 29 Codeforces programming problems split into easy, medium, and hard levels. It measures how often each model produces an accepted solution plus memory use and runtime. On easy tasks the two models perform similarly, yet on medium tasks ChatGPT reaches 54.5 percent success while DeepSeek-R1 reaches only 18.1 percent. Both models fail most hard tasks. Readers may care because the results point to which model is currently more useful for assisting with moderate programming work.

Core claim

The study shows that ChatGPT 03-mini and DeepSeek-R1 achieve comparable results on easy tasks, but ChatGPT attains a 54.5 percent success rate on medium-difficulty tasks against DeepSeek-R1's 18.1 percent success rate, with both models failing to solve most hard tasks when judged by accepted solutions, memory efficiency, and runtime performance.

What carries the argument

Comparison of success rates, memory, and runtime on 29 Codeforces tasks grouped by easy, medium, and hard difficulty using single-run prompting.

If this is right

  • Programmers may choose ChatGPT over DeepSeek-R1 when seeking help on medium-complexity problems.
  • Both models need further advances before they can reliably handle hard competitive programming tasks.
  • The observed gap in medium-task success highlights differences in how the two models handle moderate algorithmic complexity.
  • The results supply a concrete benchmark for tracking future improvements in LLM coding performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The medium-task gap could arise from differences in training data or internal architecture that the single evaluation does not isolate.
  • Testing a larger set of problems or additional models would show whether the 54.5-to-18.1 split holds more broadly.
  • Developers could combine the two models, using ChatGPT for medium tasks and another approach for hard ones.

Load-bearing premise

The 29 chosen tasks, the prompting method, and the single-run protocol give an unbiased picture of each model's programming ability.

What would settle it

Repeating the 29 tasks with varied prompts or multiple runs and obtaining a DeepSeek-R1 medium-task success rate near or above 54.5 percent would undermine the reported performance gap.

Figures

Figures reproduced from arXiv: 2503.13549 by Mohammad Khalil, Ronas Shakya, Sam Urmian.

Figure 1
Figure 1. Figure 1: Codeforces for competitive programming B. Study setting ChatGPT-03-mini was used to generate the code for ChatGPT, while DeepSeek-R1 model was used for DeepSeek. On Codeforces, multiple compiler options are available for C++ submissions, including versions of GNU G++, Clang, and MSVC. To ensure a fair comparison and uniform execution across all solutions, we selected GNU G++20 13.2 (64-bit, winlib) as the … view at source ↗
Figure 2
Figure 2. Figure 2: Example prompt for generating coding soluti [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fig.3. The bar graph shows the average weighted sco [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Time comparison for 29 programming tasks usi [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Memory usage comparison for 29 programming t [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

The advancement of large language models (LLMs) has created a competitive landscape for AI-assisted programming tools. This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces. Using 29 programming tasks of three levels of easy, medium, and hard difficulty, we assessed the outcome of both models by their accepted solutions, memory efficiency, and runtime performance. Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks, achieving a 54.5% success rate compared to DeepSeek 18.1%. Both models struggled with hard tasks, thus highlighting some ongoing challenges LLMs face in handling highly complex programming problems. These findings highlight key differences in both model capabilities and their computational power, offering valuable insights for developers and researchers working to advance AI-driven programming tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to evaluate ChatGPT o3-mini and DeepSeek-R1 on 29 Codeforces competitive programming tasks across easy, medium, and hard difficulty levels. It reports similar performance on easy tasks, a substantial advantage for ChatGPT on medium tasks (54.5% success rate versus 18.1% for DeepSeek), and poor performance by both models on hard tasks, with outcomes assessed via accepted solutions, memory efficiency, and runtime.

Significance. If the experimental protocol were fully documented with reproducible prompting details, multiple trials, and statistical analysis, the work would supply a direct empirical head-to-head comparison of two LLMs on competitive programming benchmarks. The reported medium-task gap could inform practical model selection in AI-assisted coding, but the current presentation leaves the central quantitative claims unsubstantiated.

major comments (2)
  1. [Abstract] Abstract: The headline performance figures (54.5% versus 18.1% success on medium-difficulty tasks) are stated without any accompanying description of prompting strategy, number of generations per task, temperature settings, or statistical tests. Because LLMs are stochastic, these omissions make the reported gap impossible to verify or interpret as a reliable capability difference.
  2. [Abstract] Abstract: The evaluation rests on a single generation per task across only 29 problems with an unspecified distribution across difficulty levels. No variance estimates or repeated-sampling protocol is mentioned, so even modest per-task stochasticity could reverse the observed medium-task gap; this single-run design is load-bearing for the central comparative claim.
minor comments (1)
  1. [Abstract] Abstract: The model name appears as 'ChatGPT 03-mini'; this should be corrected to the standard 'o3-mini' designation for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our experimental protocol. We address each major comment below and will revise the manuscript accordingly to improve reproducibility and acknowledge limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance figures (54.5% versus 18.1% success on medium-difficulty tasks) are stated without any accompanying description of prompting strategy, number of generations per task, temperature settings, or statistical tests. Because LLMs are stochastic, these omissions make the reported gap impossible to verify or interpret as a reliable capability difference.

    Authors: We agree that the abstract (and current manuscript) omits these details. In revision we will add a Methods section describing the exact prompting strategy (zero-shot prompts with task-specific instructions), confirm a single generation per task, note default temperature settings, and state that no statistical tests were applied. This will allow readers to assess the reliability of the reported gap. revision: yes

  2. Referee: [Abstract] Abstract: The evaluation rests on a single generation per task across only 29 problems with an unspecified distribution across difficulty levels. No variance estimates or repeated-sampling protocol is mentioned, so even modest per-task stochasticity could reverse the observed medium-task gap; this single-run design is load-bearing for the central comparative claim.

    Authors: We acknowledge the single-generation design and small sample. The revision will explicitly report the task distribution (e.g., counts per difficulty), state that only one generation was used per problem, and discuss the implications of stochasticity as a limitation. We will also add a note recommending repeated sampling in future studies. The current results reflect the single-run outcomes obtained. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark with no derivations

full rationale

The paper is a straightforward empirical comparison of two LLMs on 29 Codeforces tasks, reporting observed success rates (e.g., 54.5% vs 18.1% on medium tasks) without any equations, fitted parameters, predictions, ansatzes, or self-citations that reduce claims to inputs by construction. All load-bearing steps are external measurements on independent model outputs, satisfying the criteria for a self-contained non-circular result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study rests on standard benchmarking assumptions rather than new mathematical structure; the main unstated premises concern fair prompting and representative task selection.

free parameters (1)
  • task selection
    Choice of which 29 problems to include can influence the reported percentages and is not justified by external criteria in the abstract.
axioms (1)
  • domain assumption Models received equivalent prompts and evaluation conditions
    Required for any valid head-to-head comparison but not described in the abstract.

pith-pipeline@v0.9.0 · 5685 in / 1167 out tokens · 70513 ms · 2026-05-23T00:21:18.722251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    StuGPTViz: A visual analytics approach to understa nd student- ChatGPT interactions,

    Z. Chen, J. Wang, M. Xia, K. Shigyo, D. Liu, R. Zha ng, and H. Qu, “StuGPTViz: A visual analytics approach to understa nd student- ChatGPT interactions,” arXiv preprint arXiv:2407.12423 , 2024. [Online]. Available: https://arxiv.org/abs/2407.12423

  2. [2]

    DeepSeek vs. OpenAI: Comparing the New AI Titans,

    V. Chugani, “DeepSeek vs. OpenAI: Comparing the New AI Titans,” DataCamp , Feb. 6, 2025. [Online]. Available: https://www.datacamp.com/blog/deepseek-vs-openai

  3. [3]

    Effect iveness of ChatGPT in coding: A comparative analysis of popular large language models,

    C. E. Coello, M. N. Alimam, and R. Kouatly, “Effect iveness of ChatGPT in coding: A comparative analysis of popular large language models,” Digital , vol. 4, no. 1, pp. 114–125, 2024. doi: 10.3390/digital4010005

  4. [4]

    DeepSeek vs ChatGPT vs Gemini: Choosing the Right AI for Your Needs,

    Dirox , “DeepSeek vs ChatGPT vs Gemini: Choosing the Right AI for Your Needs,” Feb. 20, 2025. [Online]. Available: https://dirox.com/post/deepseek-vs-chatgpt-vs-gemini-ai-comparison

  5. [5]

    Token-Hungry, Yet Precise: DeepSeek R1 highlights the need for multi-step reasoning over speed in MATH,

    E. Evstafev, “Token-Hungry, Yet Precise: DeepSeek R1 highlights the need for multi-step reasoning over speed in MATH,” arXiv preprint arXiv:2501.18576 , 2025. doi: 10.48550/arxiv.2501.18576

  6. [6]

    A Com parison of DeepSeek and Other LLMs,

    T. Gao, J. Jin, Z. T. Ke, and G. Moryoussef, “A Com parison of DeepSeek and Other LLMs,” arXiv preprint arXiv:2502.03688 , 2025. doi: 10.48550/arxiv.2502.03688

  7. [7]

    China’s cheap, open AI model DeepSeek thrills scientists,

    E. Gibney, “China’s cheap, open AI model DeepSeek thrills scientists,” Nature , pp. 13–14, 2025. doi: 10.1038/d41586-025-00229-6

  8. [8]

    Large language models must be taught to know what they don’t know,

    S. Kapoor, “Large language models must be taught to know what they don’t know,” arXiv preprint arXiv:2406.08391v2 , 2023. [Online]. Available: https://arxiv.org/html/2406.08391v2

  9. [9]

    DeepSeek vs ChatGPT: Comparing Features i n 2025,

    G. Kaur, “DeepSeek vs ChatGPT: Comparing Features i n 2025,” Cointelegraph , 2025. [Online]. Available: https://cointelegraph.com/learn/articles/deepseek-vs-chatgpt

  10. [10]

    Ericson, David Weintrop, and Tovi Grossman

    M. Kazemitabaar et al., “How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment,” arXiv preprint arXiv:2309.14049 , 2023. doi: 10.48550/arXiv.2309.14049

  11. [11]

    Codeforces as an educational platform for learning programming in digitalization,

    M. Mirzayanov et al., “Codeforces as an educational platform for learning programming in digitalization,” Olympiads in Informatics , pp. 133–142, 2020. doi: 10.15388/ioi.2020.10

  12. [12]

    How beginning programmers and code LLMs ( mis)read each other,

    S. Nguyen, H. M. Babe, Y. Zi, A. Guha, C. J. Anders on, & M. Q. Feldman, “How beginning programmers and code LLMs ( mis)read each other,” arXiv preprint arXiv:3613904.3642706 , 2024. doi: 10.1145/3613904.3642706

  13. [13]

    Chinese AI makes a strong showing: Dee pSeek-R1 outperforms ChatGPT in performance and efficiency,

    A. Noriega, “Chinese AI makes a strong showing: Dee pSeek-R1 outperforms ChatGPT in performance and efficiency,” Driving ECO , Jan. 23, 2025. [Online]. Available: https://www.drivingeco.com/en/ia- china-pisa-fuerte-deepseek-r1-supera-chatgpt-rendimiento-eficiencia

  14. [14]

    Hello GPT-4o,

    OpenAI, “Hello GPT-4o,” OpenAI , 2024. [Online]. Available: https://openai.com/index/hello-gpt-4o

  15. [15]

    OpenAI o3-mini,

    OpenAI, “OpenAI o3-mini,” OpenAI , 2025. [Online]. Available: https://openai.com/index/openai-o3-mini

  16. [16]

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

    S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of AI on developer productivity: Evidence from GitHub Copilot,” arXiv preprint arXiv:2302.06590 , 2023. doi: 10.48550/arxiv.2302.06590

  17. [17]

    ChatGPT o3-mini-high: A leap forward in AI reasoning,

    L. Perez, “ChatGPT o3-mini-high: A leap forward in AI reasoning,” Neuroflash , Feb. 5, 2025. [Online]. Available: https://neuroflash.com/blog/chatgpt-o3-mini-high

  18. [18]

    AI-assisted coding: Experiments with GPT-4,

    R. A. Poldrack, T. Lu, and G. Beguš, “AI-assisted coding: Experiments with GPT-4,” arXiv preprint arXiv:2304.13187 , 2023. [Online]. Available: https://arxiv.org/abs/2304.13187

  19. [19]

    Exploring the use of ChatGPT as a tool for learning and assessment in undergraduate computer science curric ulum: Opportunities and challenges,

    B. Qureshi, “Exploring the use of ChatGPT as a tool for learning and assessment in undergraduate computer science curric ulum: Opportunities and challenges,” arXiv preprint arXiv:2304.11214 ,

  20. [20]

    doi: 10.48550/arxiv.2304.11214

  21. [21]

    Industrial experience report on AI-assisted coding in professi onal software development,

    R. Ramler, M. Moser, L. Fischer, M. Nissl, & R. Hei nzl, “Industrial experience report on AI-assisted coding in professi onal software development,” in Proc. 1st Int. Workshop Large Language Models for Code (LLM4Code ’24) , 2024, pp. 1–7. doi: 10.1145/3643795.3648377

  22. [22]

    Test-case-driven programming understanding in large language models for better code generation,

    Z. Tian and J. Chen, “Test-case-driven programming understanding in large language models for better code generation,” arXiv preprint arXiv:2309.16120 , 2023. doi: 10.48550/arxiv.2309.16120

  23. [23]

    Bridging novice programmers and LLMs with interactivity,

    T. Y. Yeh, K. Tran, G. Gao, T. Yu, W. O. Fong, T. Y . & Chen, “Bridging novice programmers and LLMs with interactivity,” in Proc. 56th ACM Tech. Symp. Comput. Sci. Educ. (SIGCSETS 2 025) , 2025, pp. 1295–1301. doi: 10.1145/3641554.3701867

  24. [24]

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu .. . & W. Liang, “DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence,” arXiv preprint arXiv:2406.11931 , [n.d.]. [Online]. Available: https://arxiv.org/pdf/2406.11931

  25. [25]

    From idea to implementation : Evaluating the influence of large language models in software deve lopment—An opinion paper,

    S. Yadav, A. M. Qureshi, A. Kaushik, S. Sharma, R. Loughran, S. Kazhuparambil, et al., "From idea to implementation : Evaluating the influence of large language models in software deve lopment—An opinion paper," arXiv preprint arXiv:2503.07450 , 2025

  26. [26]

    L2ceval: Evaluating language-to-code generation capabilities of large language models,

    A. Ni, P. Yin, Y. Zhao, M. Riddell, T. Feng, R. Shen, et al., "L2ceval: Evaluating language-to-code generation capabilities of large language models," Trans. Assoc. Comput. Linguist ., vol. 12, pp. 1311–1329, 2024