Recognition: 2 theorem links
· Lean TheoremQwen2.5-Coder Technical Report
Pith reviewed 2026-05-10 12:28 UTC · model grok-4.3
The pith
Qwen2.5-Coder models reach state-of-the-art code performance across sizes by continued pretraining on over 5.5 trillion tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Qwen2.5-Coder series, built on the Qwen2.5 architecture and continued pretrained on over 5.5 trillion tokens through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, achieves state-of-the-art performance across more than 10 benchmarks for code generation, completion, reasoning, and repair while retaining general and math skills and consistently outperforming larger models of the same size.
What carries the argument
Continued pretraining on a vast code corpus of over 5.5 trillion tokens using data cleaning, synthetic data generation, and balanced mixing on the Qwen2.5 architecture.
If this is right
- Code generation and repair tasks become solvable at high quality with models that fit on modest hardware.
- Specialized training can produce code skills that exceed what raw size alone delivers in competing models.
- General and math performance stays available, so the models function as versatile assistants rather than narrow tools.
- Permissive licensing allows direct integration into developer workflows and further research without restrictions.
Where Pith is reading between the lines
- The same data preparation steps could transfer to other narrow domains if comparable volumes of clean and synthetic data exist.
- Smaller models in the series open the door to on-device code completion and debugging features in everyday software.
- Combining these models with existing general-purpose systems might create hybrid setups that handle mixed coding and non-coding queries efficiently.
Load-bearing premise
The chosen benchmarks and evaluation conditions provide a fair, unbiased measure of real code capabilities that allows direct comparison to other models.
What would settle it
An independent test on a fresh collection of real developer code problems from open repositories where the Qwen2.5-Coder models fail to match or exceed the performance of larger models of the same size.
read the original abstract
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will advance research in code intelligence and, with its permissive licensing, support wider adoption by developers in real-world applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Qwen2.5-Coder series of six code-specialized models (0.5B to 32B parameters) built on the Qwen2.5 architecture. These undergo continued pretraining on a 5.5-trillion-token code corpus using data cleaning, scalable synthetic data generation, and balanced mixing. The report claims the resulting models achieve state-of-the-art performance on more than 10 benchmarks spanning code generation, completion, reasoning, and repair, while retaining general and math capabilities, and consistently outperform larger models of equivalent size.
Significance. If the performance claims are substantiated with reproducible details, the work would be significant for releasing a family of strong, permissively licensed code models at multiple scales. The scale of the continued pretraining corpus and the explicit effort to preserve non-code skills via balanced mixing represent a practical contribution to specialized LLM development that could support both research and developer adoption.
major comments (3)
- [Abstract] Abstract: The central claim of 'state-of-the-art (SOTA) performance across more than 10 benchmarks' and 'consistently outperforming larger models of the same model size' supplies no benchmark names, baseline models, evaluation methodology (prompting format, few-shot count, decoding parameters, temperature/top-p), error bars, or statistical tests. This absence prevents verification of whether the data support the outperformance assertion.
- [Pretraining description] Pretraining description: Continued pretraining on >5.5 trillion tokens creates a material risk of test-set contamination for the cited code benchmarks. The manuscript provides no description of decontamination procedures, overlap checks, or synthetic-data filtering steps that would be required to support the integrity of the SOTA results.
- [Evaluation section] Evaluation section: No information is given on whether all compared models (including larger baselines) were evaluated under identical conditions, benchmark versions, or prompting setups. Any deviation would undermine the cross-model size comparison that is load-bearing for the main claim.
minor comments (2)
- [Abstract] The model-size notation 'Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B)' is compact but could be expanded into a clearer bulleted list for readability.
- [Abstract] The phrase 'impressive code generation capabilities' is subjective; replacing it with a brief quantitative reference to the claimed benchmark gains would improve precision.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the current manuscript would benefit from greater specificity in the abstract, pretraining description, and evaluation section to improve verifiability and address potential concerns about contamination and fair comparison. We will incorporate revisions to resolve these issues.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of 'state-of-the-art (SOTA) performance across more than 10 benchmarks' and 'consistently outperforming larger models of the same model size' supplies no benchmark names, baseline models, evaluation methodology (prompting format, few-shot count, decoding parameters, temperature/top-p), error bars, or statistical tests. This absence prevents verification of whether the data support the outperformance assertion.
Authors: We agree that the abstract would be strengthened by naming the primary benchmarks and baselines and by briefly indicating the evaluation protocol. In the revised manuscript we will expand the abstract to list the key benchmarks (HumanEval, MBPP, LiveCodeBench, BigCodeBench, etc.), the main comparison models, and a concise statement of the shared prompting and decoding settings. Full tables with per-benchmark scores, error bars, and statistical comparisons will remain in the Evaluation section, but the abstract will now reference them explicitly. revision: yes
-
Referee: [Pretraining description] Pretraining description: Continued pretraining on >5.5 trillion tokens creates a material risk of test-set contamination for the cited code benchmarks. The manuscript provides no description of decontamination procedures, overlap checks, or synthetic-data filtering steps that would be required to support the integrity of the SOTA results.
Authors: This is a legitimate concern. The current manuscript does not describe decontamination steps. We will add a new subsection under Data Preparation that details (1) n-gram and embedding-based overlap checks performed against the public versions of the evaluation benchmarks, (2) removal of any detected contaminated samples from the 5.5-trillion-token corpus, and (3) the filtering rules applied during synthetic data generation to prevent benchmark leakage. These procedures were followed during training and will now be documented. revision: yes
-
Referee: [Evaluation section] Evaluation section: No information is given on whether all compared models (including larger baselines) were evaluated under identical conditions, benchmark versions, or prompting setups. Any deviation would undermine the cross-model size comparison that is load-bearing for the main claim.
Authors: We confirm that every model—including the larger baselines—was run under a single, fixed evaluation harness using identical benchmark versions, prompt templates, few-shot counts, and decoding parameters (temperature 0.2, top-p 0.95, max tokens 512). The manuscript simply omits an explicit statement of this uniformity. In the revision we will insert a dedicated paragraph at the start of the Evaluation section that enumerates the common protocol, benchmark versions, and hyper-parameters so that the size-comparison claims rest on clearly documented identical conditions. revision: yes
Circularity Check
No circularity: empirical SOTA claims rest on external benchmarks
full rationale
The paper reports continued pretraining of Qwen2.5-based models on a 5.5T-token code corpus, followed by data cleaning, synthetic data generation, and balanced mixing, then direct evaluation on public code benchmarks. No equations, fitted parameters, or derivations are present that could reduce to self-definition or self-citation. Performance claims compare against external models under stated conditions; the chain is self-contained against independent benchmarks and does not invoke any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer architecture is effective for modeling code sequences
Forward citations
Cited by 60 Pith papers
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
-
Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning
HLS-Seek replaces full-synthesis RL with a comparative proxy reward model plus uncertainty-triggered real checks, yielding higher correctness and better QoR than larger models at 8.5x lower training cost.
-
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
-
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
-
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
-
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.
-
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but s...
-
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
PlantMarkerBench is a new multi-species benchmark with 5,550 evidence instances for evaluating language models on literature-grounded plant marker gene reasoning across expression, localization, function, indirect, an...
-
Trust Me, Import This: Dependency Steering Attacks via Malicious Agent Skills
Malicious Skills induce coding agents to hallucinate and import attacker-controlled packages at high rates while evading detection.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
-
Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative
Mean-pooled cosine similarity grows with sequence length in anisotropic transformer embeddings independent of content, while CKA shows far less length dependence across code, translation, and vision tasks.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair
ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.
-
LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation
LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall per...
-
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
-
Understanding Human Actions through the Lens of Executable Models
EXACT is a new DSL for human motions as executable reward-generating programs, enabling compositional neuro-symbolic models that improve data efficiency and capture intuitive action relationships over monolithic approaches.
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
Cognitive Policy-Driven LLM for Diagnosis and Intervention of Cognitive Distortions in Emotional Support Conversation
The CogBiasESC dataset and CoPoLLM framework enable LLMs to diagnose cognitive distortions and apply interventions in emotional support conversations, outperforming baselines on accuracy, effectiveness, and safety.
-
Modeling Multi-Dimensional Cognitive States in Large Language Models under Cognitive Crowding
CognitiveBench reveals LLMs suffer representation overlap on joint cognitive tasks due to hierarchical structure; HyCoLLM in hyperbolic space fixes the mismatch and outperforms GPT-4o with far fewer parameters.
-
Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding
Schema-key wording functions as an implicit instruction channel under constrained decoding, with experiments showing that rephrasing only the keys can substantially change accuracy on math benchmarks while prompt, mod...
-
LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software
Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new Bi...
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
-
An Iterative Test-and-Repair Framework for Competitive Code Generation
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
Automating Database-Native Function Code Synthesis with LLMs
DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreS...
-
Think Anywhere in Code Generation
Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
-
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
-
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
-
Understanding and Accelerating the Training of Masked Diffusion Language Models
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
-
When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems
A wrapper for black-box generate-verify AI pipelines that uses a conservative hard-negative reference pool and e-processes to control the probability of releasing on infeasible tasks while permitting release on feasible ones.
-
Scalable Token-Level Hallucination Detection in Large Language Models
TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
-
Uncertainty Quantification for LLM-based Code Generation
RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
-
Step Rejection Fine-Tuning: A Practical Distillation Recipe
Step Rejection Fine-Tuning masks loss on erroneous steps identified by a critic LLM in unresolved trajectories, raising SWE-bench Verified resolution rate by 3.7% to 32.2% versus 2.4% for trajectory-level rejection.
-
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR uses supervised fine-tuning on verified fixes, dual sequence- and line-level reward models from execution feedback, and PPO to reach 40.7% on SWE-bench Verified with strong cross-language results.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
-
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization
SecureForge audits LLM code for vulnerabilities, builds a synthetic prompt corpus via Markovian sampling, and optimizes system prompts to cut security issues by up to 48% while preserving unit test performance, with z...
-
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery task...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 ,
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car- los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988,
-
[3]
Program Synthesis with Large Language Models
2024.06.21. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Efficient training of language models to fill in the middle
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255,
-
[6]
Language Models are Few-Shot Learners
Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[7]
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feld- man, et al. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227,
-
[8]
Mceval: Massively multilingual code evaluation.arXiv preprint arXiv:2406.07436, 2024
Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, et al. Mceval: Massively multilingual code evaluation. arXiv preprint arXiv:2406.07436,
-
[9]
How to prompt LLMs for text-to-SQL: A study in zero-shot, single- domain, and cross-domain settings
Shuaichen Chang and Eric Fosler-Lussier. How to prompt llms for text-to-sql: A study in zero-shot, single-domain, and cross-domain settings. arXiv preprint arXiv:2305.11853,
-
[10]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Theoremqa: A theorem-driven question answering dataset
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7889–7901,
work page 2023
-
[12]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132,
work page internal anchor Pith review arXiv
-
[13]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Codebert: A pre-trained model for programming and natural languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, v...
work page 2020
-
[17]
doi: 10.18653/V1/2020.FINDINGS-EMNLP .139. URL https://doi.org/10.18653/v1/2020.findings-emnlp.139. Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? arXiv preprint arXiv:2406.04127,
-
[18]
Evaluation of llms on syntax-aware code fill-in-the-middle tasks
Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. Evaluation of llms on syntax-aware code fill-in-the-middle tasks. arXiv preprint arXiv:2403.04814,
-
[19]
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Alex Gu, Baptiste Rozi`ere, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065,
work page internal anchor Pith review arXiv
-
[20]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024a. Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, ...
work page internal anchor Pith review arXiv 2009
-
[21]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review arXiv
-
[22]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review arXiv
-
[23]
AQ Jiang, A Sablayrolles, A Mensch, C Bamford, DS Chaplot, D de las Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b (2023). arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
StarCoder: may the source be with you!
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36, 2024a. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis ...
work page internal anchor Pith review arXiv
-
[25]
AutoKaggle : A Multi - Agent Framework for Autonomous Data Science Competitions
30 Technical Report Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tianyu Zheng, Xinyao Niu, Xiang Yue, Yue Wang, Jian Yang, Jiaheng Liu, et al. Autokaggle: A multi-agent framework for autonomous data science competitions. arXiv preprint arXiv:2410.20424, 2024b. Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human fals...
-
[26]
J Liu, CS Xia, Y Wang, and L Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arxiv preprint arxiv: 230501210. 2023,
work page 2023
-
[27]
M2rc-eval: Massively multilingual repository-level code completion evaluation
Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, et al. M2rc-eval: Massively multilingual repository-level code completion evaluation. arXiv preprint arXiv:2410.21157, 2024a. Shukai Liu, Linzheng Chai, Jian Yang, Jiajun Shi, He Zhu, Liran Wang, Ke Jin, Wei Zhang, Hualei Zhu, Shuyue Guo, et al. ...
-
[28]
Reacc: A retrieval-augmented code completion framework
Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svy- atkovskiy. Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722,
-
[29]
2024.05.29. OpenAI. Gpt-4o. https://openai.com/index/hello-gpt-4o,
work page 2024
-
[30]
YaRN: Efficient Context Window Extension of Large Language Models
2024.05.13. Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071,
work page internal anchor Pith review arXiv 2024
-
[31]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
URL https://qwenlm.github.io/blog/ codeqwen1.5/. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290,
work page internal anchor Pith review arXiv
-
[32]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,
work page internal anchor Pith review arXiv
-
[33]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641,
work page internal anchor Pith review arXiv 1907
-
[34]
Unicoder: Scaling code large language model via universal code
Tao Sun, Linzheng Chai, Jian Yang, Yuwei Yin, Hongcheng Guo, Jiaheng Liu, Bing Wang, Liqun Yang, and Zhoujun Li. Unicoder: Scaling code large language model via universal code. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, ...
work page 2024
-
[35]
Llama 2: Open Foundation and Fine-Tuned Chat Models
URL https://aclanthology.org/2024.acl-long.100. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Magicoder: Em- powering code generation with oss-instruct
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Em- powering code generation with oss-instruct. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net,
work page 2024
-
[37]
31 Technical Report Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma
URL https://openreview.net/forum?id=XUeoOBid3x. 31 Technical Report Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma. Repoformer: Selective retrieval for repository-level code completion. arXiv preprint arXiv:2403.10059, 2024a. Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin Shu, Xia...
-
[38]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887,
-
[39]
Wavecoder: Widespread and versatile enhancement for code large language models by instruction tuning
Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, and Qiufeng Yin. Wavecoder: Widespread and versatile enhancement for code large language models by instruction tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume...
work page 2024
-
[40]
doi: 10.18653/V1/2024. ACL-LONG.280. URL https://doi.org/10.18653/v1/2024.acl-long.280. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,
- [41]
-
[42]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.