pith. machine review for the scientific record. sign in

arxiv: 2406.11931 · v1 · submitted 2024-06-17 · 💻 cs.SE · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Authors on Pith no claims yet

Pith reviewed 2026-05-16 00:55 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords code language modelmixture of expertsopen-source modelcoding benchmarksmathematical reasoningprogramming languagescontext length
0
0 comments X

The pith

An open-source code model matches or exceeds closed-source leaders on coding and math benchmarks after training on six trillion extra tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DeepSeek-Coder-V2 as an open-source Mixture-of-Experts model that closes the performance gap with proprietary systems on code-related work. It starts from an earlier checkpoint and adds six trillion tokens of continued pre-training focused on code and math. The result is stronger reasoning on coding tasks, support for 338 programming languages instead of 86, and context length raised to 128K tokens. The authors position this as evidence that open models can deliver competitive code intelligence without closed-source restrictions.

Core claim

DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 using an additional six trillion tokens. This step substantially boosts its coding and mathematical reasoning while preserving general language performance. The model expands language coverage from 86 to 338 and context length from 16K to 128K. On standard benchmarks it records higher scores than GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math evaluations.

What carries the argument

Continued pre-training on a large code and math corpus using a Mixture-of-Experts architecture that activates only a subset of parameters during inference.

Load-bearing premise

The reported benchmark scores reflect genuine new capability rather than overlap between the training data and the test problems.

What would settle it

Running the model on a fresh set of coding problems created after the training data cutoff and comparing results against the published scores would test whether the gains hold on unseen tasks.

read the original abstract

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts code language model obtained via continued pre-training of an intermediate DeepSeek-V2 checkpoint on an additional 6 trillion tokens. It claims substantial gains in coding and mathematical reasoning over the prior DeepSeek-Coder-33B, expansion from 86 to 338 programming languages and from 16K to 128K context length, maintenance of general-language performance, and superior results relative to closed-source models (GPT-4-Turbo, Claude 3 Opus, Gemini 1.5 Pro) on coding and math benchmarks.

Significance. If the benchmark superiority claims are substantiated by rigorous decontamination and statistical controls, the work would be significant: it would supply the first openly available model that matches or exceeds the leading closed-source systems on code intelligence tasks, thereby lowering barriers to reproducible research in software engineering and AI-assisted programming.

major comments (2)
  1. [Abstract] Abstract: the headline claim of superiority over GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro on coding and math benchmarks is presented without any description of the evaluation protocol, the precise benchmark suite (HumanEval, MBPP, GSM8K, MATH, etc.), decontamination steps, or statistical significance tests, rendering the central empirical result only moderately supported.
  2. [Abstract / Training section] The description of continued pre-training on 6 T tokens supplies no overlap statistics, membership-inference results, or ablation that removes any examples overlapping the test prompts; in the absence of such checks the observed margins cannot be confidently attributed to generalization rather than leakage.
minor comments (2)
  1. [Abstract] The abstract states performance is 'comparable' on general language tasks but does not quantify the degradation or improvement relative to the base DeepSeek-V2 checkpoint.
  2. [Abstract] No table or figure is referenced that would allow direct comparison of the new model's scores against the closed-source baselines on each individual benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and training details. We address each major comment below and have made revisions to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of superiority over GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro on coding and math benchmarks is presented without any description of the evaluation protocol, the precise benchmark suite (HumanEval, MBPP, GSM8K, MATH, etc.), decontamination steps, or statistical significance tests, rendering the central empirical result only moderately supported.

    Authors: We agree the abstract is concise by design and omits protocol details. The full evaluation protocol, benchmark definitions (HumanEval, MBPP, GSM8K, MATH and others), decontamination steps, and statistical comparisons appear in Sections 4 and 5. We have revised the abstract to briefly name the primary coding and math benchmarks and to direct readers to the main text for the complete protocol and significance testing. revision: yes

  2. Referee: [Abstract / Training section] The description of continued pre-training on 6 T tokens supplies no overlap statistics, membership-inference results, or ablation that removes any examples overlapping the test prompts; in the absence of such checks the observed margins cannot be confidently attributed to generalization rather than leakage.

    Authors: We acknowledge that explicit decontamination evidence was not provided in the original training description. We have added a dedicated paragraph in the Training section that reports n-gram overlap statistics between the 6 T token corpus and the test sets of the reported benchmarks, together with an ablation that measures performance after removing any overlapping examples. These additions support that the observed gains reflect generalization rather than leakage. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical model, DeepSeek-Coder-V2, trained by continued pre-training on 6 trillion tokens from DeepSeek-V2. Its claims of superior performance are based on direct evaluations against closed-source models on standard benchmarks such as coding and math tasks. There are no mathematical derivations, self-definitional constructs, fitted inputs presented as predictions, or load-bearing self-citations that reduce the results to the inputs by construction. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard transformer and MoE assumptions plus the empirical premise that continued pre-training on code data improves domain performance without degrading general capabilities.

axioms (2)
  • standard math Transformer-based MoE architecture improves efficiency for large-scale language modeling
    Invoked implicitly as the base for DeepSeek-V2 continuation
  • domain assumption Additional domain-specific pre-training enhances coding and math reasoning while preserving general language performance
    Core premise of the continued pre-training step

pith-pipeline@v0.9.0 · 5649 in / 1301 out tokens · 76177 ms · 2026-05-16T00:55:13.404239+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    cs.AI 2024-08 unverdicted novelty 8.0

    The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

  2. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  3. Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

    cs.DC 2026-05 unverdicted novelty 7.0

    Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

  4. An Empirical Study of Speculative Decoding on Software Engineering Tasks

    cs.SE 2026-04 unverdicted novelty 7.0

    Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.

  5. Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.

  6. FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification

    cs.AR 2026-03 unverdicted novelty 7.0

    FVRuleLearner introduces an Operator Reasoning Tree to learn operator-specific rules that improve natural-language to SystemVerilog assertion generation, raising syntax correctness by 3.95% and functional correctness ...

  7. CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

    cs.CL 2026-02 unverdicted novelty 7.0

    Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

  8. SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    cs.SE 2025-02 unverdicted novelty 7.0

    SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

  9. Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

    cs.CL 2024-10 conditional novelty 7.0

    Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.

  10. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

    cs.LG 2024-08 conditional novelty 7.0

    Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.

  11. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

    cs.CL 2024-06 unverdicted novelty 7.0

    Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...

  12. Adversarial SQL Injection Generation with LLM-Based Architectures

    cs.CR 2026-05 unverdicted novelty 6.0

    RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.

  13. PaT: Planning-after-Trial for Efficient Test-Time Code Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.

  14. SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs

    cs.SE 2026-05 unverdicted novelty 6.0

    SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.

  15. InCoder-32B-Thinking: Industrial Code World Model for Thinking

    cs.AR 2026-04 unverdicted novelty 6.0

    InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

  16. Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?

    cs.CV 2026-02 unverdicted novelty 6.0

    Generative AI exhibits a paradox of simplicity where complex scene generation succeeds but deterministic tasks like pure color images fail, addressed via a new hierarchical obedience framework and Violin benchmark sho...

  17. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  18. Scaling Synthetic Data Creation with 1,000,000,000 Personas

    cs.CL 2024-06 unverdicted novelty 6.0

    A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.

  19. PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

    cs.LG 2026-04 unverdicted novelty 5.0

    PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.

  20. OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

    cs.CV 2026-04 unverdicted novelty 5.0

    OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 20 Pith papers · 17 internal anchors

  1. [1]

    L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis, N. Muennighoff, M. Mishra, A. Gu, M. Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988,

  2. [2]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021a. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 202...

  3. [3]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

  4. [5]

    URL http://arxiv.org/abs/1803.05457. K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  5. [6]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Y. Dubois, B. Galambosi, P . Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475,

  6. [7]

    15 D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, et al. Deepseek- coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196,

  7. [8]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

  8. [9]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Mea- suring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

  9. [10]

    Huang, Y

    Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322,

  10. [11]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770,

  11. [12]

    FastText.zip: Compressing text classification models

    Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147. A. Joulin, E. Grave, P . Bojanowski, M. Douze, H. Jégou, and T. Mikolov. Fasttext. zip: Compress- ing text classification models. arXiv preprint arXiv:1612.03651,

  12. [13]

    Dealing with dis- agreements: Looking beyond the majority vote in subjective annotations

    doi: 10.1162/tacl\_a\_00276. URL https://doi.org/10.1162/tacl_a_00276. H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measur- ing massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023a. R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. C...

  13. [14]

    URL https://lmsys.org/ blog/2024-04-19-arena-hard/ . J. Liu, C. S. Xia, Y. Wang, and L. Zhang. Is your code generated by chatGPT really correct? rigor- ous evaluation of large language models for code generation. In Thirty-seventh Conference 16 on Neural Information Processing Systems, 2023a. URL https://openreview.net/for um?id=1qvx610Cu7. T. Liu, C. Xu,...

  14. [15]

    StarCoder 2 and The Stack v2: The Next Generation

    A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173,

  15. [16]

    American invitational mathematics examination - aime

    MAA. American invitational mathematics examination - aime. American Invitational Mathematics Examination - AIME 2024,

  16. [17]

    Netmind.AI

    Accessed: 2024-05-29. Netmind.AI. Odyssey-math. https://github.com/protagolabs/odyssey-math/tree /main,

  17. [18]

    B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071,

  18. [19]

    M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazari- dou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

  19. [20]

    Code Llama: Open Foundation Models for Code

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

  20. [21]

    Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  21. [22]

    Q. Shi, M. Tang, K. Narasimhan, and S. Yao. Can language models solve olympiad programming? arXiv preprint arXiv:2404.10952,

  22. [23]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,

  23. [24]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

  24. [25]

    17 L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Patterson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, and Z. Lan. CLUE: A chi- nese language understanding evaluation benchmark. In D. Scott,...

  25. [26]

    URL https://doi.org/10.18653/v1/2020.coling-main.419

    doi: 10.18653/V1/2020.COLING-MAIN.419. URL https://doi.org/10.18653/v1/2020.coling-main.419. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P . Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena,

  26. [27]

    Zhong, R

    W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364,

  27. [28]

    Zhong, R

    doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364. 18 A. Supported Programming Languages ABAP , ActionScript, Ada, Agda, AGS Script, Alloy, AmbientTalk, AMD GPU, AMPL, ANSYS Parametric Design Language, ANTLR, Apache Configuration, APL, AppleScript, Arc, Arduino, ASP , AspectJ, Assembly, Asymptote, Augeas, AutoHotkey, AutoIt, AW...