Where Do Large Language Models Fail on Competitive Programming? A Taxonomy of Failures by Algorithm Type and Difficulty Rating

Ayush Kumar Jha; Shalini Jha

arxiv: 2606.05228 · v1 · pith:NM6WLGETnew · submitted 2026-06-02 · 💻 cs.SE · cs.PL

Where Do Large Language Models Fail on Competitive Programming? A Taxonomy of Failures by Algorithm Type and Difficulty Rating

Ayush Kumar Jha , Shalini Jha This is my paper

Pith reviewed 2026-06-28 08:35 UTC · model grok-4.3

classification 💻 cs.SE cs.PL

keywords large language modelscompetitive programmingchain of thoughtfailure analysiscode generationalgorithm categoriesGPT-4oClaude

0 comments

The pith

Chain-of-thought prompting lowers GPT-4o's competitive programming pass rate from 46% to 36.8% while tripling Claude's compile errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates GPT-4o and Claude on 315 Codeforces problems organized into seven algorithm categories and three difficulty tiers. It compares direct zero-shot generation against zero-shot chain-of-thought prompting at fixed temperature. Forcing chain-of-thought reduces GPT-4o's overall success and worsens its performance on greedy problems, while the same change increases Claude's compile errors by 244 percent through poorer adherence to output format. Wrong answers constitute the large majority of failures for both models. The results indicate that common prompt techniques do not close the gap between general language-model strengths and the demands of algorithmic coding tasks.

Core claim

The ablation study shows a severe divergence from standard NLP benchmarks: forcing CoT aggressively penalizes GPT-4o, dropping its pass rate from 46.0% to 36.8% and exacerbating a critical weakness in Greedy logic. Conversely, while Claude maintains a higher logical baseline (63.5% under CoT), the expanded text generation severely degrades its markdown instruction adherence, causing its Compile Errors to more than triple (from 9 to 31, a 244% increase). Wrong Answer is the dominant verdict for both models, accounting for over 90% of GPT-4o's and roughly 70% of Claude's unaccepted solutions.

What carries the argument

A balanced taxonomy of 315 Codeforces problems across seven algorithm categories and three difficulty tiers, used to compare direct zero-shot generation against zero-shot Chain-of-Thought under controlled execution-based scoring.

If this is right

Wrong Answer remains the dominant failure mode, exceeding 90 percent for GPT-4o and 70 percent for Claude.
Chain-of-thought exacerbates GPT-4o's weakness on greedy algorithms.
Chain-of-thought triples Claude's compile errors by degrading markdown format adherence.
Standard prompt-engineering techniques do not bridge the algorithmic reasoning gap in competitive programming.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model-specific prompting strategies may be needed rather than a single CoT template for algorithmic tasks.
The category-level failure data could be used to prioritize training data for greedy or other weak algorithm types.
The same taxonomy could serve as a diagnostic benchmark when new models or prompting methods are introduced.
Real-world coding assistants that default to chain-of-thought may inherit the same format and logic weaknesses observed here.

Load-bearing premise

The 315 Codeforces problems form a balanced and representative taxonomy across seven algorithm categories and three difficulty tiers that allows reliable generalization of observed failure patterns.

What would settle it

Re-running the identical evaluation protocol on a fresh collection of at least 300 Codeforces problems drawn from the same seven algorithm categories and three difficulty tiers and obtaining materially different pass-rate changes or error distributions under CoT.

Figures

Figures reproduced from arXiv: 2606.05228 by Ayush Kumar Jha, Shalini Jha.

**Figure 2.** Figure 2: Pass rates (%) across the Algorithm × Difficulty taxonomy under the Chain-of-Thought (CoT) condition. GPT-4o exhibits a severe capability collapse at the Hard tier (dropping to 0% on DP and Binary Search), while Claude Sonnet 4.6 maintains structural stability. 4.5 Failure Mode Analysis (RQ3) Evaluating the raw distribution of verdict labels provides crucial insight into the nature of LLM coding failures. … view at source ↗

**Figure 3.** Figure 3: Distribution of failure verdicts under CoT prompting. The requirement to generate verbose mathematical [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of GPT-4o’s failure types across algorithm categories. The uniform dominance of Wrong Answer [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large language models (LLMs) demonstrate increasing proficiency on competitive programming benchmarks, yet technical reports predominantly publish aggregate pass rates, obscuring domain-specific vulnerabilities. We present a systematic empirical study of LLM failure patterns using a balanced taxonomy of 315 Codeforces problems across seven algorithm categories and three difficulty tiers. We evaluate GPT-4o and Claude Sonnet 4.6 under strict execution-based conditions, controlling for temperature (T = 0.2). To isolate the impact of reasoning frameworks on algorithmic correctness, we conduct an ablation study comparing direct zero-shot generation against zero-shot Chain-of-Thought (CoT). Our findings reveal a severe divergence from standard NLP benchmarks: forcing CoT aggressively penalizes GPT-4o, dropping its pass rate from 46.0% to 36.8% and exacerbating a critical weakness in Greedy logic. Conversely, while Claude maintains a higher logical baseline (63.5% under CoT), the expanded text generation severely degrades its markdown instruction adherence, causing its Compile Errors to more than triple (from 9 to 31, a 244% increase). Furthermore, failure-mode analysis indicates that Wrong Answer (WA) is the dominant verdict for both models--accounting for over 90% of GPT-4o's and roughly 70% of Claude's unaccepted solutions. These findings empirically demonstrate that standard prompt engineering techniques fail to bridge the algorithmic reasoning gap in competitive programming environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoT prompting drops GPT-4o pass rate on these Codeforces problems while tripling Claude's compile errors, with wrong answers dominating both.

read the letter

The paper's main contribution is an empirical breakdown of where two LLMs fail on 315 Codeforces problems, sorted by seven algorithm categories and three difficulty levels. It reports that zero-shot CoT lowers GPT-4o's overall pass rate from 46% to 36.8% and hurts its Greedy performance in particular, while Claude holds a higher baseline but sees compile errors jump from 9 to 31 because the longer outputs break markdown formatting. Wrong answers account for over 90% of GPT-4o's failures and about 70% of Claude's.

What works is the move away from single aggregate scores toward category-specific failure tracking under fixed temperature and strict execution. The direct ablation on prompting style produces concrete, model-differentiated numbers that are easy to check against other benchmarks.

The soft spot is the problem set itself. The abstract calls the taxonomy balanced across categories and tiers, but gives no sampling frame, inclusion rules, or per-category counts. If the 315 problems were not chosen through stratified selection, the reported Greedy weakness or the WA dominance could simply reflect how many problems of each type ended up in the collection. That makes generalization to other competitive programming sets uncertain until the distribution and selection details are shown.

This is useful for researchers who already run coding evaluations and want finer-grained diagnostics on prompting and model differences. It is not foundational, but the measurements are straightforward enough that a referee could verify the execution setup and ask for the missing selection statistics. I would send it to review rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a systematic empirical study of LLM failures on competitive programming tasks. It evaluates GPT-4o and Claude Sonnet 4.6 on a set of 315 Codeforces problems organized into a taxonomy of seven algorithm categories and three difficulty tiers. The study compares zero-shot direct generation against zero-shot Chain-of-Thought prompting under fixed temperature (T=0.2), reports pass rates (e.g., GPT-4o drops from 46.0% to 36.8% with CoT), and analyzes dominant failure modes such as Wrong Answer (>90% for GPT-4o) and increased Compile Errors for Claude under CoT.

Significance. If the problem taxonomy is representative, the results would usefully document model-specific divergences from NLP benchmarks, including the counter-intuitive penalty from CoT on GPT-4o for Greedy problems and the markdown adherence degradation in Claude. The work supplies concrete numerical contrasts and failure-mode breakdowns that could guide targeted improvements in algorithmic reasoning for LLMs.

major comments (1)

[Abstract / §3] Abstract and §3 (Problem Selection / Taxonomy Construction): the central claims about generalizable failure patterns (CoT penalizing GPT-4o by 9.2 pp, tripling Claude compile errors, WA dominating 70-90% of failures) rest on the assertion of a 'balanced taxonomy' across seven categories and three tiers. No sampling frame, inclusion/exclusion criteria, stratification procedure, or post-selection balance statistics (e.g., problem counts per cell) are supplied, so it is impossible to determine whether observed patterns reflect intrinsic algorithmic weaknesses or artifacts of the chosen 315 problems.

minor comments (1)

[Abstract] The abstract reports exact percentages and error counts but does not indicate whether statistical significance tests or confidence intervals accompany the pass-rate differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important methodological point. The concern about transparency in problem selection is valid, and we will strengthen the manuscript accordingly while preserving the core empirical findings on failure modes.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (Problem Selection / Taxonomy Construction): the central claims about generalizable failure patterns (CoT penalizing GPT-4o by 9.2 pp, tripling Claude compile errors, WA dominating 70-90% of failures) rest on the assertion of a 'balanced taxonomy' across seven categories and three tiers. No sampling frame, inclusion/exclusion criteria, stratification procedure, or post-selection balance statistics (e.g., problem counts per cell) are supplied, so it is impossible to determine whether observed patterns reflect intrinsic algorithmic weaknesses or artifacts of the chosen 315 problems.

Authors: We agree that explicit documentation of the selection process is necessary for readers to evaluate potential selection artifacts. The 315 problems were manually curated from Codeforces to achieve coverage across the seven algorithm categories (e.g., Greedy, Dynamic Programming) and three difficulty tiers while ensuring each problem had publicly available test cases and a unique primary algorithm tag. In the revised manuscript we will add to §3: (1) the precise inclusion criteria (problems from contests 800–2000 rating, English statements, non-interactive, with at least 10 test cases), (2) the stratification procedure (target minimum of 10–15 problems per category-tier cell, with oversampling in underrepresented cells such as high-difficulty Graph problems), and (3) a new table reporting the final cell counts. We will also clarify that the taxonomy is a curated, balanced sample rather than a random or exhaustive draw from all Codeforces problems; the observed patterns (e.g., CoT penalty on GPT-4o) are therefore best interpreted as applying to this representative coverage of algorithmic types rather than as population estimates. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper reports pass rates, failure-mode counts, and ablation results from direct evaluation of GPT-4o and Claude on a fixed set of 315 Codeforces problems under controlled prompting conditions. No equations, fitted parameters, predictions derived from prior inputs, uniqueness theorems, or ansatzes appear anywhere in the text. All reported quantities (e.g., 46.0% to 36.8% pass-rate drop, 9-to-31 compile-error increase, WA dominance percentages) are direct empirical measurements, not reductions of the input data by construction. The representativeness claim is an unverified assumption about external validity rather than a circular step inside the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the representativeness of the chosen problem set and standard assumptions of execution-based LLM evaluation; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 315 Codeforces problems constitute a balanced taxonomy across seven algorithm categories and three difficulty tiers.
This premise enables the systematic failure-mode analysis by category and tier.

pith-pipeline@v0.9.1-grok · 5796 in / 1309 out tokens · 35023 ms · 2026-06-28T08:35:43.187453+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 5 canonical work pages · 5 internal anchors

[1]

The Claude 3 model family: Opus, sonnet, haiku

Anthropic. The Claude 3 model family: Opus, sonnet, haiku. Technical report, Anthropic, 2024

2024
[2]

Claude Sonnet 4.6

Anthropic. Claude Sonnet 4.6. Anthropic model announcement, 2026. Released February 2026. https: //www.anthropic.com/news

2026
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

CodeContests: A competitive programming dataset

DeepMind. CodeContests: A competitive programming dataset. HuggingFace Datasets, 2022. https:// huggingface.co/datasets/deepmind/code_contests

2022
[5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, et al. DeepSeek-Coder: When the large language model meets programming – the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks), 2021. arXiv:2105.09938

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

LiveCodeBench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

2025
[8]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

2022
[9]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, et al. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023
[10]

Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama

Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Is self-repair a silver bullet for code generation? InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024
[11]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023
[14]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 24824–24837, 2022. 11 LLM Failure Taxonomy on Competitive ProgrammingPREPRINT Appendices A Prompt Templa...

2022

[1] [1]

The Claude 3 model family: Opus, sonnet, haiku

Anthropic. The Claude 3 model family: Opus, sonnet, haiku. Technical report, Anthropic, 2024

2024

[2] [2]

Claude Sonnet 4.6

Anthropic. Claude Sonnet 4.6. Anthropic model announcement, 2026. Released February 2026. https: //www.anthropic.com/news

2026

[3] [3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

CodeContests: A competitive programming dataset

DeepMind. CodeContests: A competitive programming dataset. HuggingFace Datasets, 2022. https:// huggingface.co/datasets/deepmind/code_contests

2022

[5] [5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, et al. DeepSeek-Coder: When the large language model meets programming – the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks), 2021. arXiv:2105.09938

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

LiveCodeBench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

2025

[8] [8]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

2022

[9] [9]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, et al. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023

[10] [10]

Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama

Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Is self-repair a silver bullet for code generation? InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024

[11] [11]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023

[14] [14]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 24824–24837, 2022. 11 LLM Failure Taxonomy on Competitive ProgrammingPREPRINT Appendices A Prompt Templa...

2022