arxiv: 2409.12186 · v3 · submitted 2024-09-18 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Qwen2.5-Coder Technical Report

Binyuan Hui , Jian Yang , Zeyu Cui , Jiaxi Yang , Dayiheng Liu , Lei Zhang , Tianyu Liu , Jiajun Zhang

show 16 more authors

Bowen Yu Keming Lu Kai Dang Yang Fan Yichang Zhang An Yang Rui Men Fei Huang Bo Zheng Yibo Miao Shanghaoran Quan Yunlong Feng Xingzhang Ren Xuancheng Ren Jingren Zhou Junyang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:28 UTC · model grok-4.3

classification 💻 cs.CL

keywords Qwen2.5-Codercode generationlarge language modelspretrainingsynthetic datacode benchmarksmodel evaluationcode repair

0 comments

The pith

Qwen2.5-Coder models reach state-of-the-art code performance across sizes by continued pretraining on over 5.5 trillion tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Qwen2.5-Coder series of models sized from 0.5 billion to 32 billion parameters as an upgrade over earlier code-focused versions. It builds on the Qwen2.5 base through continued pretraining on a large code corpus combined with data cleaning, synthetic data creation, and balanced mixing of sources. This process produces strong results on code generation, completion, reasoning, and repair tasks. The models often beat larger models of similar scale while keeping general knowledge and math abilities intact. The work matters because it points to practical ways to build capable coding tools that developers can run and adapt without needing the biggest possible systems.

Core claim

The Qwen2.5-Coder series, built on the Qwen2.5 architecture and continued pretrained on over 5.5 trillion tokens through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, achieves state-of-the-art performance across more than 10 benchmarks for code generation, completion, reasoning, and repair while retaining general and math skills and consistently outperforming larger models of the same size.

What carries the argument

Continued pretraining on a vast code corpus of over 5.5 trillion tokens using data cleaning, synthetic data generation, and balanced mixing on the Qwen2.5 architecture.

If this is right

Code generation and repair tasks become solvable at high quality with models that fit on modest hardware.
Specialized training can produce code skills that exceed what raw size alone delivers in competing models.
General and math performance stays available, so the models function as versatile assistants rather than narrow tools.
Permissive licensing allows direct integration into developer workflows and further research without restrictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data preparation steps could transfer to other narrow domains if comparable volumes of clean and synthetic data exist.
Smaller models in the series open the door to on-device code completion and debugging features in everyday software.
Combining these models with existing general-purpose systems might create hybrid setups that handle mixed coding and non-coding queries efficiently.

Load-bearing premise

The chosen benchmarks and evaluation conditions provide a fair, unbiased measure of real code capabilities that allows direct comparison to other models.

What would settle it

An independent test on a fresh collection of real developer code problems from open repositories where the Qwen2.5-Coder models fail to match or exceed the performance of larger models of the same size.

read the original abstract

In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will advance research in code intelligence and, with its permissive licensing, support wider adoption by developers in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen2.5-Coder is a scaled-up version of their prior code model with solid data work but the SOTA claims need tighter evaluation details to land cleanly.

read the letter

The main thing here is that the Qwen team took their Qwen2.5 base, kept pretraining it on more than 5.5 trillion code tokens, and produced a family of models from 0.5B to 32B that they say outperform larger models on code tasks while holding onto general and math ability. The report walks through the practical steps they took: data cleaning, synthetic data generation, and careful mixing to avoid catastrophic forgetting. That part is useful because it shows what a real production-scale run looks like when you start from a strong base and focus on code without wrecking everything else. Releasing the models openly with a permissive license also gives the community something concrete to test and build on, which is more valuable than another closed model announcement. The training recipe itself is straightforward and well-motivated, and the fact that they kept six different sizes lets readers see scaling behavior directly. The soft spots sit mostly in the evaluation. The paper claims state-of-the-art results across more than ten benchmarks for generation, completion, reasoning, and repair, but it does not lay out the exact prompting formats, decoding parameters, few-shot counts, or benchmark versions used for the comparisons. With 5.5 trillion tokens of pretraining, the risk of test-set contamination is real, and the report does not include explicit decontamination checks or ablations that would rule it out. The stress-test note is fair on this point; without those controls the outperformance numbers are harder to trust at face value. This is the kind of paper that practitioners and applied researchers working on code intelligence will actually read and use. It does not introduce new methods, but it documents a successful iteration that others can replicate or extend. I would send it to peer review because the models are released and the claims are at least partially falsifiable by running the open weights, even if the write-up would need a stronger methods and evaluation section to hold up under scrutiny.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Qwen2.5-Coder series of six code-specialized models (0.5B to 32B parameters) built on the Qwen2.5 architecture. These undergo continued pretraining on a 5.5-trillion-token code corpus using data cleaning, scalable synthetic data generation, and balanced mixing. The report claims the resulting models achieve state-of-the-art performance on more than 10 benchmarks spanning code generation, completion, reasoning, and repair, while retaining general and math capabilities, and consistently outperform larger models of equivalent size.

Significance. If the performance claims are substantiated with reproducible details, the work would be significant for releasing a family of strong, permissively licensed code models at multiple scales. The scale of the continued pretraining corpus and the explicit effort to preserve non-code skills via balanced mixing represent a practical contribution to specialized LLM development that could support both research and developer adoption.

major comments (3)

[Abstract] Abstract: The central claim of 'state-of-the-art (SOTA) performance across more than 10 benchmarks' and 'consistently outperforming larger models of the same model size' supplies no benchmark names, baseline models, evaluation methodology (prompting format, few-shot count, decoding parameters, temperature/top-p), error bars, or statistical tests. This absence prevents verification of whether the data support the outperformance assertion.
[Pretraining description] Pretraining description: Continued pretraining on >5.5 trillion tokens creates a material risk of test-set contamination for the cited code benchmarks. The manuscript provides no description of decontamination procedures, overlap checks, or synthetic-data filtering steps that would be required to support the integrity of the SOTA results.
[Evaluation section] Evaluation section: No information is given on whether all compared models (including larger baselines) were evaluated under identical conditions, benchmark versions, or prompting setups. Any deviation would undermine the cross-model size comparison that is load-bearing for the main claim.

minor comments (2)

[Abstract] The model-size notation 'Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B)' is compact but could be expanded into a clearer bulleted list for readability.
[Abstract] The phrase 'impressive code generation capabilities' is subjective; replacing it with a brief quantitative reference to the claimed benchmark gains would improve precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript would benefit from greater specificity in the abstract, pretraining description, and evaluation section to improve verifiability and address potential concerns about contamination and fair comparison. We will incorporate revisions to resolve these issues.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'state-of-the-art (SOTA) performance across more than 10 benchmarks' and 'consistently outperforming larger models of the same model size' supplies no benchmark names, baseline models, evaluation methodology (prompting format, few-shot count, decoding parameters, temperature/top-p), error bars, or statistical tests. This absence prevents verification of whether the data support the outperformance assertion.

Authors: We agree that the abstract would be strengthened by naming the primary benchmarks and baselines and by briefly indicating the evaluation protocol. In the revised manuscript we will expand the abstract to list the key benchmarks (HumanEval, MBPP, LiveCodeBench, BigCodeBench, etc.), the main comparison models, and a concise statement of the shared prompting and decoding settings. Full tables with per-benchmark scores, error bars, and statistical comparisons will remain in the Evaluation section, but the abstract will now reference them explicitly. revision: yes
Referee: [Pretraining description] Pretraining description: Continued pretraining on >5.5 trillion tokens creates a material risk of test-set contamination for the cited code benchmarks. The manuscript provides no description of decontamination procedures, overlap checks, or synthetic-data filtering steps that would be required to support the integrity of the SOTA results.

Authors: This is a legitimate concern. The current manuscript does not describe decontamination steps. We will add a new subsection under Data Preparation that details (1) n-gram and embedding-based overlap checks performed against the public versions of the evaluation benchmarks, (2) removal of any detected contaminated samples from the 5.5-trillion-token corpus, and (3) the filtering rules applied during synthetic data generation to prevent benchmark leakage. These procedures were followed during training and will now be documented. revision: yes
Referee: [Evaluation section] Evaluation section: No information is given on whether all compared models (including larger baselines) were evaluated under identical conditions, benchmark versions, or prompting setups. Any deviation would undermine the cross-model size comparison that is load-bearing for the main claim.

Authors: We confirm that every model—including the larger baselines—was run under a single, fixed evaluation harness using identical benchmark versions, prompt templates, few-shot counts, and decoding parameters (temperature 0.2, top-p 0.95, max tokens 512). The manuscript simply omits an explicit statement of this uniformity. In the revision we will insert a dedicated paragraph at the start of the Evaluation section that enumerates the common protocol, benchmark versions, and hyper-parameters so that the size-comparison claims rest on clearly documented identical conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SOTA claims rest on external benchmarks

full rationale

The paper reports continued pretraining of Qwen2.5-based models on a 5.5T-token code corpus, followed by data cleaning, synthetic data generation, and balanced mixing, then direct evaluation on public code benchmarks. No equations, fitted parameters, or derivations are present that could reduce to self-definition or self-citation. Performance claims compare against external models under stated conditions; the chain is self-contained against independent benchmarks and does not invoke any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work implicitly relies on standard transformer pretraining assumptions common to LLM literature.

axioms (1)

domain assumption Transformer architecture is effective for modeling code sequences
Models are built upon the Qwen2.5 architecture

pith-pipeline@v0.9.0 · 5578 in / 1094 out tokens · 62237 ms · 2026-05-10T12:28:34.514129+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
cs.CR 2026-04 unverdicted novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

HLS-Seek replaces full-synthesis RL with a comparative proxy reward model plus uncertainty-triggered real checks, yielding higher correctness and better QoR than larger models at 8.5x lower training cost.
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
cs.AR 2026-05 unverdicted novelty 7.0

Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
cs.CL 2026-05 unverdicted novelty 7.0

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
cs.AI 2026-05 unverdicted novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
cs.CV 2026-05 unverdicted novelty 7.0

VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
cs.LG 2026-05 unverdicted novelty 7.0

DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
cs.CV 2026-05 unverdicted novelty 7.0

SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but s...
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

PlantMarkerBench is a new multi-species benchmark with 5,550 evidence instances for evaluating language models on literature-grounded plant marker gene reasoning across expression, localization, function, indirect, an...
Trust Me, Import This: Dependency Steering Attacks via Malicious Agent Skills
cs.CR 2026-05 unverdicted novelty 7.0

Malicious Skills induce coding agents to hallucinate and import attacker-controlled packages at high rates while evading detection.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 unverdicted novelty 7.0

EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 conditional novelty 7.0

EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 7.0

BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative
cs.CL 2026-05 unverdicted novelty 7.0

Mean-pooled cosine similarity grows with sequence length in anisotropic transformer embeddings independent of content, while CKA shows far less length dependence across code, translation, and vision tasks.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
cs.LG 2026-05 unverdicted novelty 7.0

Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair
cs.SE 2026-05 unverdicted novelty 7.0

ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.
LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation
cs.SE 2026-05 conditional novelty 7.0

LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall per...
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
cs.SE 2026-04 unverdicted novelty 7.0

A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
cs.CL 2026-04 unverdicted novelty 7.0

TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
Understanding Human Actions through the Lens of Executable Models
cs.AI 2026-04 unverdicted novelty 7.0

EXACT is a new DSL for human motions as executable reward-generating programs, enabling compositional neuro-symbolic models that improve data efficiency and capture intuitive action relationships over monolithic approaches.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Cognitive Policy-Driven LLM for Diagnosis and Intervention of Cognitive Distortions in Emotional Support Conversation
cs.CL 2026-04 unverdicted novelty 7.0

The CogBiasESC dataset and CoPoLLM framework enable LLMs to diagnose cognitive distortions and apply interventions in emotional support conversations, outperforming baselines on accuracy, effectiveness, and safety.
Modeling Multi-Dimensional Cognitive States in Large Language Models under Cognitive Crowding
cs.CL 2026-04 unverdicted novelty 7.0

CognitiveBench reveals LLMs suffer representation overlap on joint cognitive tasks due to hierarchical structure; HyCoLLM in hyperbolic space fixes the mismatch and outperforms GPT-4o with far fewer parameters.
Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding
cs.CL 2026-04 unverdicted novelty 7.0

Schema-key wording functions as an implicit instruction channel under constrained decoding, with experiments showing that rephrasing only the keys can substantially change accuracy on math benchmarks while prompt, mod...
LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software
cs.CR 2026-04 unverdicted novelty 7.0

Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
cs.SE 2026-04 accept novelty 7.0

CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
cs.SE 2026-04 unverdicted novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
cs.SE 2026-04 unverdicted novelty 7.0

LLM deobfuscation of binaries to pseudocode depends more on reasoning ability and task-specific fine-tuning than on model size, with reasoning models showing robustness across ISAs and obfuscation levels on the new Bi...
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
cs.AI 2026-04 unverdicted novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor
cs.SE 2026-04 unverdicted novelty 7.0

ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
An Iterative Test-and-Repair Framework for Competitive Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
Automating Database-Native Function Code Synthesis with LLMs
cs.DB 2026-04 conditional novelty 7.0

DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreS...
Think Anywhere in Code Generation
cs.SE 2026-03 unverdicted novelty 7.0

Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
cs.CL 2026-03 conditional novelty 7.0

TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
Understanding and Accelerating the Training of Masked Diffusion Language Models
cs.LG 2026-05 conditional novelty 6.0

Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems
stat.ML 2026-05 unverdicted novelty 6.0

A wrapper for black-box generate-verify AI pipelines that uses a conservative hard-negative reference pool and e-processes to control the probability of releasing on infeasible tasks while permitting release on feasible ones.
Scalable Token-Level Hallucination Detection in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
Uncertainty Quantification for LLM-based Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

RisCoSet applies multiple hypothesis testing to construct risk-controlling partial-program prediction sets for LLM code generation, achieving up to 24.5% less code removal than prior methods at equivalent risk levels.
Step Rejection Fine-Tuning: A Practical Distillation Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Step Rejection Fine-Tuning masks loss on erroneous steps identified by a critic LLM in unresolved trajectories, raising SWE-bench Verified resolution rate by 3.7% to 32.2% versus 2.4% for trajectory-level rejection.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
cs.LG 2026-05 unverdicted novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

BoostAPR uses supervised fine-tuning on verified fixes, dual sequence- and line-level reward models from execution feedback, and PPO to reach 40.7% on SWE-bench Verified with strong cross-language results.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
cs.AI 2026-05 unverdicted novelty 6.0

A critique-and-routing controller cast as a finite-horizon MDP with policy-gradient optimization outperforms one-shot routing baselines on reasoning benchmarks while using the strongest agent for under 25% of calls.
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
cs.LG 2026-05 unverdicted novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization
cs.CR 2026-05 unverdicted novelty 6.0

SecureForge audits LLM code for vulnerabilities, builds a synthetic prompt corpus via Markovian sampling, and optimizes system prompts to cut security issues by up to 48% while preserving unit test performance, with z...
POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
cs.LG 2026-05 unverdicted novelty 6.0

POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery task...
Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate
cs.LG 2026-05 conditional novelty 6.0

Mage shows compile-pass rate is anti-correlated with functional correctness in LLM game scene generation; direct NL-to-C# yields 43% runtime but F1~0.12 structure, while IR conditioning recovers structure (F1 up to 1....

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 131 Pith papers · 21 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 ,

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Car- los Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988,

work page arXiv
[3]

Program Synthesis with Large Language Models

2024.06.21. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Efficient training of language models to fill in the middle

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255,

work page arXiv
[6]

Language Models are Few-Shot Learners

Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[7]

J., Feldman, M

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feld- man, et al. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227,

work page arXiv
[8]

Mceval: Massively multilingual code evaluation.arXiv preprint arXiv:2406.07436, 2024

Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang, Changyu Ren, Hongcheng Guo, et al. Mceval: Massively multilingual code evaluation. arXiv preprint arXiv:2406.07436,

work page arXiv
[9]

How to prompt LLMs for text-to-SQL: A study in zero-shot, single- domain, and cross-domain settings

Shuaichen Chang and Eric Fosler-Lussier. How to prompt llms for text-to-sql: A study in zero-shot, single-domain, and cross-domain settings. arXiv preprint arXiv:2305.11853,

work page arXiv
[10]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7889–7901,

work page 2023
[12]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132,

work page internal anchor Pith review arXiv
[13]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Codebert: A pre-trained model for programming and natural languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, v...

work page 2020
[17]

findings-acl.262/

doi: 10.18653/V1/2020.FINDINGS-EMNLP .139. URL https://doi.org/10.18653/v1/2020.findings-emnlp.139. Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? arXiv preprint arXiv:2406.04127,

work page doi:10.18653/v1/2020.findings-emnlp 2020
[18]

Evaluation of llms on syntax-aware code fill-in-the-middle tasks

Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. Evaluation of llms on syntax-aware code fill-in-the-middle tasks. arXiv preprint arXiv:2403.04814,

work page arXiv
[19]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Alex Gu, Baptiste Rozi`ere, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065,

work page internal anchor Pith review arXiv
[20]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024a. Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, ...

work page internal anchor Pith review arXiv 2009
[21]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review arXiv
[22]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review arXiv
[23]

Mistral 7B

AQ Jiang, A Sablayrolles, A Mensch, C Bamford, DS Chaplot, D de las Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b (2023). arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

StarCoder: may the source be with you!

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36, 2024a. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis ...

work page internal anchor Pith review arXiv
[25]

AutoKaggle : A Multi - Agent Framework for Autonomous Data Science Competitions

30 Technical Report Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tianyu Zheng, Xinyao Niu, Xiang Yue, Yue Wang, Jian Yang, Jiaheng Liu, et al. Autokaggle: A multi-agent framework for autonomous data science competitions. arXiv preprint arXiv:2410.20424, 2024b. Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human fals...

work page arXiv
[26]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

J Liu, CS Xia, Y Wang, and L Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arxiv preprint arxiv: 230501210. 2023,

work page 2023
[27]

M2rc-eval: Massively multilingual repository-level code completion evaluation

Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin, et al. M2rc-eval: Massively multilingual repository-level code completion evaluation. arXiv preprint arXiv:2410.21157, 2024a. Shukai Liu, Linzheng Chai, Jian Yang, Jiajun Shi, He Zhu, Liran Wang, Ke Jin, Wei Zhang, Hualei Zhu, Shuyue Guo, et al. ...

work page arXiv
[28]

Reacc: A retrieval-augmented code completion framework

Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svy- atkovskiy. Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722,

work page arXiv
[29]

2024.05.29. OpenAI. Gpt-4o. https://openai.com/index/hello-gpt-4o,

work page 2024
[30]

YaRN: Efficient Context Window Extension of Large Language Models

2024.05.13. Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071,

work page internal anchor Pith review arXiv 2024
[31]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

URL https://qwenlm.github.io/blog/ codeqwen1.5/. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290,

work page internal anchor Pith review arXiv
[32]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

work page internal anchor Pith review arXiv
[33]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641,

work page internal anchor Pith review arXiv 1907
[34]

Unicoder: Scaling code large language model via universal code

Tao Sun, Linzheng Chai, Jian Yang, Yuwei Yin, Hongcheng Guo, Jiaheng Liu, Bing Wang, Liqun Yang, and Zhoujun Li. Unicoder: Scaling code large language model via universal code. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, ...

work page 2024
[35]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URL https://aclanthology.org/2024.acl-long.100. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Magicoder: Em- powering code generation with oss-instruct

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Em- powering code generation with oss-instruct. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net,

work page 2024
[37]

31 Technical Report Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma

URL https://openreview.net/forum?id=XUeoOBid3x. 31 Technical Report Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma. Repoformer: Selective retrieval for repository-level code completion. arXiv preprint arXiv:2403.10059, 2024a. Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin Shu, Xia...

work page arXiv
[38]

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887,

work page Pith review arXiv
[39]

Wavecoder: Widespread and versatile enhancement for code large language models by instruction tuning

Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, and Qiufeng Yin. Wavecoder: Widespread and versatile enhancement for code large language models by instruction tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume...

work page 2024
[40]

Rethinking tabular data understanding with large language models

doi: 10.18653/V1/2024. ACL-LONG.280. URL https://doi.org/10.18653/v1/2024.acl-long.280. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page doi:10.18653/v1/2024 2024
[41]

Zhang, B

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian- Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570,

work page arXiv
[42]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877,

work page internal anchor Pith review arXiv