A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks
Pith reviewed 2026-05-23 00:21 UTC · model grok-4.3
The pith
ChatGPT solves over half of medium Codeforces tasks while DeepSeek-R1 solves under one-fifth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study shows that ChatGPT 03-mini and DeepSeek-R1 achieve comparable results on easy tasks, but ChatGPT attains a 54.5 percent success rate on medium-difficulty tasks against DeepSeek-R1's 18.1 percent success rate, with both models failing to solve most hard tasks when judged by accepted solutions, memory efficiency, and runtime performance.
What carries the argument
Comparison of success rates, memory, and runtime on 29 Codeforces tasks grouped by easy, medium, and hard difficulty using single-run prompting.
If this is right
- Programmers may choose ChatGPT over DeepSeek-R1 when seeking help on medium-complexity problems.
- Both models need further advances before they can reliably handle hard competitive programming tasks.
- The observed gap in medium-task success highlights differences in how the two models handle moderate algorithmic complexity.
- The results supply a concrete benchmark for tracking future improvements in LLM coding performance.
Where Pith is reading between the lines
- The medium-task gap could arise from differences in training data or internal architecture that the single evaluation does not isolate.
- Testing a larger set of problems or additional models would show whether the 54.5-to-18.1 split holds more broadly.
- Developers could combine the two models, using ChatGPT for medium tasks and another approach for hard ones.
Load-bearing premise
The 29 chosen tasks, the prompting method, and the single-run protocol give an unbiased picture of each model's programming ability.
What would settle it
Repeating the 29 tasks with varied prompts or multiple runs and obtaining a DeepSeek-R1 medium-task success rate near or above 54.5 percent would undermine the reported performance gap.
Figures
read the original abstract
The advancement of large language models (LLMs) has created a competitive landscape for AI-assisted programming tools. This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces. Using 29 programming tasks of three levels of easy, medium, and hard difficulty, we assessed the outcome of both models by their accepted solutions, memory efficiency, and runtime performance. Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks, achieving a 54.5% success rate compared to DeepSeek 18.1%. Both models struggled with hard tasks, thus highlighting some ongoing challenges LLMs face in handling highly complex programming problems. These findings highlight key differences in both model capabilities and their computational power, offering valuable insights for developers and researchers working to advance AI-driven programming tools.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to evaluate ChatGPT o3-mini and DeepSeek-R1 on 29 Codeforces competitive programming tasks across easy, medium, and hard difficulty levels. It reports similar performance on easy tasks, a substantial advantage for ChatGPT on medium tasks (54.5% success rate versus 18.1% for DeepSeek), and poor performance by both models on hard tasks, with outcomes assessed via accepted solutions, memory efficiency, and runtime.
Significance. If the experimental protocol were fully documented with reproducible prompting details, multiple trials, and statistical analysis, the work would supply a direct empirical head-to-head comparison of two LLMs on competitive programming benchmarks. The reported medium-task gap could inform practical model selection in AI-assisted coding, but the current presentation leaves the central quantitative claims unsubstantiated.
major comments (2)
- [Abstract] Abstract: The headline performance figures (54.5% versus 18.1% success on medium-difficulty tasks) are stated without any accompanying description of prompting strategy, number of generations per task, temperature settings, or statistical tests. Because LLMs are stochastic, these omissions make the reported gap impossible to verify or interpret as a reliable capability difference.
- [Abstract] Abstract: The evaluation rests on a single generation per task across only 29 problems with an unspecified distribution across difficulty levels. No variance estimates or repeated-sampling protocol is mentioned, so even modest per-task stochasticity could reverse the observed medium-task gap; this single-run design is load-bearing for the central comparative claim.
minor comments (1)
- [Abstract] Abstract: The model name appears as 'ChatGPT 03-mini'; this should be corrected to the standard 'o3-mini' designation for clarity.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for greater transparency in our experimental protocol. We address each major comment below and will revise the manuscript accordingly to improve reproducibility and acknowledge limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline performance figures (54.5% versus 18.1% success on medium-difficulty tasks) are stated without any accompanying description of prompting strategy, number of generations per task, temperature settings, or statistical tests. Because LLMs are stochastic, these omissions make the reported gap impossible to verify or interpret as a reliable capability difference.
Authors: We agree that the abstract (and current manuscript) omits these details. In revision we will add a Methods section describing the exact prompting strategy (zero-shot prompts with task-specific instructions), confirm a single generation per task, note default temperature settings, and state that no statistical tests were applied. This will allow readers to assess the reliability of the reported gap. revision: yes
-
Referee: [Abstract] Abstract: The evaluation rests on a single generation per task across only 29 problems with an unspecified distribution across difficulty levels. No variance estimates or repeated-sampling protocol is mentioned, so even modest per-task stochasticity could reverse the observed medium-task gap; this single-run design is load-bearing for the central comparative claim.
Authors: We acknowledge the single-generation design and small sample. The revision will explicitly report the task distribution (e.g., counts per difficulty), state that only one generation was used per problem, and discuss the implications of stochasticity as a limitation. We will also add a note recommending repeated sampling in future studies. The current results reflect the single-run outcomes obtained. revision: yes
Circularity Check
No circularity: direct empirical benchmark with no derivations
full rationale
The paper is a straightforward empirical comparison of two LLMs on 29 Codeforces tasks, reporting observed success rates (e.g., 54.5% vs 18.1% on medium tasks) without any equations, fitted parameters, predictions, ansatzes, or self-citations that reduce claims to inputs by construction. All load-bearing steps are external measurements on independent model outputs, satisfying the criteria for a self-contained non-circular result.
Axiom & Free-Parameter Ledger
free parameters (1)
- task selection
axioms (1)
- domain assumption Models received equivalent prompts and evaluation conditions
Reference graph
Works this paper leans on
-
[1]
StuGPTViz: A visual analytics approach to understa nd student- ChatGPT interactions,
Z. Chen, J. Wang, M. Xia, K. Shigyo, D. Liu, R. Zha ng, and H. Qu, “StuGPTViz: A visual analytics approach to understa nd student- ChatGPT interactions,” arXiv preprint arXiv:2407.12423 , 2024. [Online]. Available: https://arxiv.org/abs/2407.12423
-
[2]
DeepSeek vs. OpenAI: Comparing the New AI Titans,
V. Chugani, “DeepSeek vs. OpenAI: Comparing the New AI Titans,” DataCamp , Feb. 6, 2025. [Online]. Available: https://www.datacamp.com/blog/deepseek-vs-openai
work page 2025
-
[3]
Effect iveness of ChatGPT in coding: A comparative analysis of popular large language models,
C. E. Coello, M. N. Alimam, and R. Kouatly, “Effect iveness of ChatGPT in coding: A comparative analysis of popular large language models,” Digital , vol. 4, no. 1, pp. 114–125, 2024. doi: 10.3390/digital4010005
-
[4]
DeepSeek vs ChatGPT vs Gemini: Choosing the Right AI for Your Needs,
Dirox , “DeepSeek vs ChatGPT vs Gemini: Choosing the Right AI for Your Needs,” Feb. 20, 2025. [Online]. Available: https://dirox.com/post/deepseek-vs-chatgpt-vs-gemini-ai-comparison
work page 2025
-
[5]
E. Evstafev, “Token-Hungry, Yet Precise: DeepSeek R1 highlights the need for multi-step reasoning over speed in MATH,” arXiv preprint arXiv:2501.18576 , 2025. doi: 10.48550/arxiv.2501.18576
-
[6]
A Com parison of DeepSeek and Other LLMs,
T. Gao, J. Jin, Z. T. Ke, and G. Moryoussef, “A Com parison of DeepSeek and Other LLMs,” arXiv preprint arXiv:2502.03688 , 2025. doi: 10.48550/arxiv.2502.03688
-
[7]
China’s cheap, open AI model DeepSeek thrills scientists,
E. Gibney, “China’s cheap, open AI model DeepSeek thrills scientists,” Nature , pp. 13–14, 2025. doi: 10.1038/d41586-025-00229-6
-
[8]
Large language models must be taught to know what they don’t know,
S. Kapoor, “Large language models must be taught to know what they don’t know,” arXiv preprint arXiv:2406.08391v2 , 2023. [Online]. Available: https://arxiv.org/html/2406.08391v2
-
[9]
DeepSeek vs ChatGPT: Comparing Features i n 2025,
G. Kaur, “DeepSeek vs ChatGPT: Comparing Features i n 2025,” Cointelegraph , 2025. [Online]. Available: https://cointelegraph.com/learn/articles/deepseek-vs-chatgpt
work page 2025
-
[10]
Ericson, David Weintrop, and Tovi Grossman
M. Kazemitabaar et al., “How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment,” arXiv preprint arXiv:2309.14049 , 2023. doi: 10.48550/arXiv.2309.14049
-
[11]
Codeforces as an educational platform for learning programming in digitalization,
M. Mirzayanov et al., “Codeforces as an educational platform for learning programming in digitalization,” Olympiads in Informatics , pp. 133–142, 2020. doi: 10.15388/ioi.2020.10
-
[12]
How beginning programmers and code LLMs ( mis)read each other,
S. Nguyen, H. M. Babe, Y. Zi, A. Guha, C. J. Anders on, & M. Q. Feldman, “How beginning programmers and code LLMs ( mis)read each other,” arXiv preprint arXiv:3613904.3642706 , 2024. doi: 10.1145/3613904.3642706
-
[13]
Chinese AI makes a strong showing: Dee pSeek-R1 outperforms ChatGPT in performance and efficiency,
A. Noriega, “Chinese AI makes a strong showing: Dee pSeek-R1 outperforms ChatGPT in performance and efficiency,” Driving ECO , Jan. 23, 2025. [Online]. Available: https://www.drivingeco.com/en/ia- china-pisa-fuerte-deepseek-r1-supera-chatgpt-rendimiento-eficiencia
work page 2025
-
[14]
OpenAI, “Hello GPT-4o,” OpenAI , 2024. [Online]. Available: https://openai.com/index/hello-gpt-4o
work page 2024
-
[15]
OpenAI, “OpenAI o3-mini,” OpenAI , 2025. [Online]. Available: https://openai.com/index/openai-o3-mini
work page 2025
-
[16]
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot
S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of AI on developer productivity: Evidence from GitHub Copilot,” arXiv preprint arXiv:2302.06590 , 2023. doi: 10.48550/arxiv.2302.06590
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.06590 2023
-
[17]
ChatGPT o3-mini-high: A leap forward in AI reasoning,
L. Perez, “ChatGPT o3-mini-high: A leap forward in AI reasoning,” Neuroflash , Feb. 5, 2025. [Online]. Available: https://neuroflash.com/blog/chatgpt-o3-mini-high
work page 2025
-
[18]
AI-assisted coding: Experiments with GPT-4,
R. A. Poldrack, T. Lu, and G. Beguš, “AI-assisted coding: Experiments with GPT-4,” arXiv preprint arXiv:2304.13187 , 2023. [Online]. Available: https://arxiv.org/abs/2304.13187
-
[19]
B. Qureshi, “Exploring the use of ChatGPT as a tool for learning and assessment in undergraduate computer science curric ulum: Opportunities and challenges,” arXiv preprint arXiv:2304.11214 ,
-
[20]
doi: 10.48550/arxiv.2304.11214
-
[21]
Industrial experience report on AI-assisted coding in professi onal software development,
R. Ramler, M. Moser, L. Fischer, M. Nissl, & R. Hei nzl, “Industrial experience report on AI-assisted coding in professi onal software development,” in Proc. 1st Int. Workshop Large Language Models for Code (LLM4Code ’24) , 2024, pp. 1–7. doi: 10.1145/3643795.3648377
-
[22]
Test-case-driven programming understanding in large language models for better code generation,
Z. Tian and J. Chen, “Test-case-driven programming understanding in large language models for better code generation,” arXiv preprint arXiv:2309.16120 , 2023. doi: 10.48550/arxiv.2309.16120
-
[23]
Bridging novice programmers and LLMs with interactivity,
T. Y. Yeh, K. Tran, G. Gao, T. Yu, W. O. Fong, T. Y . & Chen, “Bridging novice programmers and LLMs with interactivity,” in Proc. 56th ACM Tech. Symp. Comput. Sci. Educ. (SIGCSETS 2 025) , 2025, pp. 1295–1301. doi: 10.1145/3641554.3701867
-
[24]
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu .. . & W. Liang, “DeepSeek-Coder-V2: Breaking the barrier of closed-source models in code intelligence,” arXiv preprint arXiv:2406.11931 , [n.d.]. [Online]. Available: https://arxiv.org/pdf/2406.11931
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
S. Yadav, A. M. Qureshi, A. Kaushik, S. Sharma, R. Loughran, S. Kazhuparambil, et al., "From idea to implementation : Evaluating the influence of large language models in software deve lopment—An opinion paper," arXiv preprint arXiv:2503.07450 , 2025
-
[26]
L2ceval: Evaluating language-to-code generation capabilities of large language models,
A. Ni, P. Yin, Y. Zhao, M. Riddell, T. Feng, R. Shen, et al., "L2ceval: Evaluating language-to-code generation capabilities of large language models," Trans. Assoc. Comput. Linguist ., vol. 12, pp. 1311–1329, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.