BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Alex Gu; Armel Randy Zebaze; Binyuan Hui; Chen Gong; Daniel Fried; David Lo; Han Hu; Haolan Zhan; Harm de Vries; Imam Nur Bani Yusuf

arxiv: 2406.15877 · v4 · pith:WIBKBDV4new · submitted 2024-06-22 · 💻 cs.SE · cs.AI· cs.CL

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan

show 25 more authors

Junda He Indraneil Paul Simon Brunner Chen Gong Thong Hoang Armel Randy Zebaze Xiaoheng Hong Wen-Ding Li Jean Kaddour Ming Xu Zhihan Zhang Prateek Yadav Naman Jain Alex Gu Zhoujun Cheng Jiawei Liu Qian Liu Zijian Wang Binyuan Hui Niklas Muennighoff David Lo Daniel Fried Xiaoning Du Harm de Vries Leandro Von Werra

This is my paper

Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords code generationlarge language modelsbenchmarksfunction callsinstruction followingsoftware engineeringtool usagepractical coding tasks

0 comments

The pith

Large language models reach only up to 60 percent success on tasks requiring precise use of diverse function calls from many libraries, far below the 97 percent human level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces BigCodeBench to test how well large language models generate code for practical tasks that demand multiple function calls drawn from 139 different libraries across seven domains. The benchmark contains 1,140 tasks, each checked by an average of 5.6 test cases that achieve 99 percent branch coverage. A parallel version called BigCodeBench-Instruct converts the original detailed docstrings into short natural-language instructions. Evaluation of 60 models shows they struggle to follow the complex instructions needed for accurate tool use. A sympathetic reader would care because everyday programming relies on composing calls from varied sources rather than solving self-contained problems.

Core claim

BigCodeBench challenges large language models to solve 1,140 fine-grained tasks by invoking multiple function calls as tools from 139 libraries across seven domains. Each task supplies an average of 5.6 test cases with 99 percent branch coverage. The authors also release BigCodeBench-Instruct, which automatically converts docstrings into short instructions containing only essential information. Testing sixty models establishes that current systems are not yet capable of following complex instructions to use function calls precisely.

What carries the argument

BigCodeBench benchmark of 1,140 tasks that require invoking multiple function calls from 139 libraries in seven domains, together with its BigCodeBench-Instruct variant that uses simplified natural-language instructions.

If this is right

Current models must improve compositional reasoning to combine multiple function calls accurately under complex instructions.
Many existing code-generation benchmarks underestimate the demands of practical tasks that draw from numerous libraries.
The large gap to human performance indicates a need for targeted advances in instruction following for code that uses external tools.
Future model development should prioritize precise and reliable use of diverse library functions in realistic settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the tasks mirror everyday coding demands, then closing the gap will likely require training methods that explicitly practice multi-library composition rather than relying on scale alone.
The benchmark could be expanded to include live execution environments to test whether models handle runtime errors and library version differences as well as static instruction following.
Providing few-shot examples of correct function-call sequences in prompts might narrow the performance difference enough to make the tasks solvable by current architectures.
The results suggest that improvements in general reasoning may not transfer directly to code tool use without additional fine-tuning on library-specific patterns.

Load-bearing premise

The 1,140 tasks and their test cases accurately represent the challenges of real-world practical coding that requires diverse function calls from many libraries.

What would settle it

A model that scores above 80 percent across the full set of 1,140 tasks on the provided test cases would show that large language models can follow such complex instructions for precise function use.

read the original abstract

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BigCodeBench gives a new multi-library code benchmark with 1140 tasks and a 60% vs 97% gap, but the tasks' real complexity needs tighter validation.

read the letter

The one thing to know is that BigCodeBench is a new benchmark for code generation that requires LLMs to make multiple function calls from a large set of libraries, and their tests on 60 models show performance maxing at 60% against 97% for humans. The paper does a good job creating 1,140 fine-grained tasks across 7 domains and 139 libraries. Each task comes with an average of 5.6 test cases that achieve 99% branch coverage, which is a solid effort to make the evaluation rigorous. They also introduce BigCodeBench-Instruct by simplifying the docstrings to short instructions, allowing a direct test of instruction following without extra context. Running this on such a wide range of LLMs provides a useful snapshot of current capabilities in compositional code writing. Credit where due: the human baseline adds real perspective, and the scale of the evaluation is larger than most prior code gen benchmarks. The results do point to a limitation in how well models handle precise tool use under complex instructions. On the other side, the central claim depends on the tasks genuinely needing diverse and compositional function calls in practical settings. The abstract describes the setup but leaves open questions about the minimum number of calls required per task or the exact criteria used to select and validate the tasks. If a portion of the tasks can be solved with shallow reasoning or fewer calls, the performance difference might not fully demonstrate the claimed challenge. More details on curation would help secure that part. This paper is for researchers and practitioners working on improving LLMs for software engineering tasks, particularly those looking for benchmarks that move past isolated algorithmic problems. Anyone evaluating code models would find the numbers and the new test set relevant. It has enough new material and empirical grounding to deserve a serious referee, though the authors should be prepared to address questions on task representativeness. I'd recommend putting it through peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces BigCodeBench, a benchmark of 1,140 fine-grained Python code-generation tasks that require invoking multiple function calls drawn from 139 libraries across 7 domains. Each task supplies an average of 5.6 test cases achieving 99 % branch coverage. A natural-language variant (BigCodeBench-Instruct) is also presented that condenses docstrings into short instructions. Evaluation of 60 LLMs yields a maximum score of 60 %, compared with 97 % human performance, leading the authors to conclude that current models cannot yet follow complex instructions to use function calls precisely.

Significance. If the 1,140 tasks genuinely demand compositional, multi-library function use under realistic instruction complexity, the reported performance gap supplies a concrete, falsifiable signal that existing LLMs remain limited in practical tool-use reasoning. The benchmark’s scale, library coverage, and high-coverage test suites would then constitute a useful addition to the code-generation evaluation suite.

major comments (2)

[Abstract and §3] Abstract and §3 (task construction): the central claim that tasks require 'compositional reasoning' and 'diverse function calls' is load-bearing for the headline 60 % vs. 97 % gap, yet the manuscript supplies no statistics on the minimum or average number of distinct function calls per task, the distribution of call counts, or explicit curation criteria that guarantee multi-library composition. Without these quantities it is impossible to verify that the observed gap reflects inability to handle complex real-world instructions rather than simpler single- or dual-call patterns.
[Evaluation section] Evaluation section (results on 60 LLMs): the human baseline of 97 % is reported without stating the number of human participants, their selection criteria, time limits, or whether they had access to the same test harness and documentation as the models. This detail is required to interpret the magnitude of the LLM–human gap.

minor comments (1)

[Abstract] Abstract, first sentence: the clause 'where the tasks ranging from …' is grammatically incomplete and should be rephrased for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional details and statistics where needed to strengthen the presentation of task complexity and evaluation methodology.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (task construction): the central claim that tasks require 'compositional reasoning' and 'diverse function calls' is load-bearing for the headline 60 % vs. 97 % gap, yet the manuscript supplies no statistics on the minimum or average number of distinct function calls per task, the distribution of call counts, or explicit curation criteria that guarantee multi-library composition. Without these quantities it is impossible to verify that the observed gap reflects inability to handle complex real-world instructions rather than simpler single- or dual-call patterns.

Authors: We agree that quantitative details on function-call composition are needed to support the claims. In the revision we will add to §3 a table and accompanying text reporting the minimum, average, and distribution of distinct function calls per task, as well as the fraction of tasks that draw functions from multiple libraries. The curation criteria (selection of tasks requiring compositional use of APIs drawn from real-world scenarios) will also be stated explicitly with supporting counts from the construction pipeline. revision: yes
Referee: [Evaluation section] Evaluation section (results on 60 LLMs): the human baseline of 97 % is reported without stating the number of human participants, their selection criteria, time limits, or whether they had access to the same test harness and documentation as the models. This detail is required to interpret the magnitude of the LLM–human gap.

Authors: We acknowledge the omission. The revised evaluation section will include the number of human participants, their selection criteria (experienced Python developers), time limits applied, and explicit confirmation that they received identical documentation and test harness access as the models. These clarifications will be added without altering the reported 97 % figure. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark construction

full rationale

The paper introduces BigCodeBench as a new collection of 1,140 tasks drawn from 139 libraries and evaluates 60 LLMs plus humans directly on them. No derivation, equation, or prediction is claimed; performance numbers are obtained by executing the released tasks and test suites. The representativeness concern raised in the skeptic note is an external-validity question, not a reduction of any result to its own inputs by construction. No self-citation chain, fitted parameter renamed as prediction, or ansatz is load-bearing for the headline claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the curated tasks and test cases represent practical coding challenges; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The selected tasks and test cases represent real-world practical coding challenges involving diverse function calls
Invoked when claiming the benchmark measures LLM capability on challenging and practical tasks.

pith-pipeline@v0.9.0 · 5705 in / 1245 out tokens · 30475 ms · 2026-05-14T19:24:25.315333+00:00 · methodology

discussion (0)

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
cs.SE 2026-03 unverdicted novelty 8.0

SlopCodeBench shows coding agents degrade in structural quality and verbosity across iterative extensions, with no agent solving any problem completely and agent code 2x more eroded than human code.
BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems
cs.AI 2026-05 conditional novelty 7.0

BOHM extracts multi-resolution attribution trees from existing routing weights in hierarchical AI systems, providing zero-cost explanations that correlate with SHAP when routing is near-optimal.
Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
cs.SE 2026-05 unverdicted novelty 7.0

CRANE merges Instruct and Thinking model checkpoints via constrained nullspace editing to improve code agent reasoning and benchmark performance without retraining.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Skill Retrieval Augmentation for Agentic AI
cs.CL 2026-04 unverdicted novelty 7.0

Introduces the SRA paradigm and SRA-Bench benchmark showing retrieval-based skill augmentation improves agent performance but skill incorporation remains a bottleneck regardless of retrieval quality.
Skill Retrieval Augmentation for Agentic AI
cs.CL 2026-04 unverdicted novelty 7.0

Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
cs.DC 2026-04 unverdicted novelty 7.0

Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
Neurosymbolic Repo-level Code Localization
cs.SE 2026-04 unverdicted novelty 7.0

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
cs.SE 2026-04 unverdicted novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode
cs.SE 2026-04 unverdicted novelty 7.0

DuET uses dual execution of generated code and pseudocode with majority voting to achieve state-of-the-art test output prediction, boosting Pass@1 by 13.6 percentage points on LiveCodeBench.
Pimp My LLM: Leveraging Variability Modeling to Tune Inference Hyperparameters
cs.LG 2026-02 unverdicted novelty 7.0

Variability modeling from software engineering enables systematic sampling, measurement, and prediction of LLM inference configurations for energy, latency, and accuracy trade-offs.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
cs.SE 2025-12 unverdicted novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
PerfCoder: Large Language Models for Interpretable Code Performance Optimization
cs.SE 2025-12 unverdicted novelty 7.0

PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than p...
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
cs.SE 2025-09 unverdicted novelty 7.0

A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%,...
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
cs.SE 2025-08 accept novelty 7.0

The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
cs.AI 2025-06 unverdicted novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
cs.SE 2025-02 unverdicted novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Design and Report Benchmarks for Knowledge Work
cs.AI 2026-05 unverdicted novelty 6.0

Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
Harnessing LLM Agents with Skill Programs
cs.AI 2026-05 conditional novelty 6.0

HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning be...
HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools
cs.CL 2026-05 unverdicted novelty 6.0

HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up...
Exploiting LLM Agent Supply Chains via Payload-less Skills
cs.CR 2026-05 conditional novelty 6.0

Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
cs.LG 2026-05 unverdicted novelty 6.0

Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
cs.SE 2026-04 conditional novelty 6.0

SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...
Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs
cs.SE 2026-04 unverdicted novelty 6.0

A perturbation method shows memorization advantage in code LLMs varies widely by model and task, remaining low on CVEFixes and Defects4J benchmarks.
InCoder-32B-Thinking: Industrial Code World Model for Thinking
cs.AR 2026-04 unverdicted novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files
cs.SE 2026-02 unverdicted novelty 6.0

ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.
ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness
cs.DC 2026-02 unverdicted novelty 6.0

ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
cs.LG 2026-01 unverdicted novelty 6.0

TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
cs.LG 2025-12 conditional novelty 6.0

LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
TRINITY: An Evolved LLM Coordinator
cs.LG 2025-12 unverdicted novelty 6.0

A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
cs.SE 2025-08 accept novelty 6.0

A group of 22 researchers proposes seven study types and eight guidelines for empirical software engineering studies involving LLMs to enhance reproducibility and replicability.
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
cs.CL 2025-08 unverdicted novelty 6.0

Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
cs.CL 2025-06 conditional novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
A Study of LLMs' Preferences for Libraries and Programming Languages
cs.SE 2025-03 unverdicted novelty 6.0

Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
cs.LG 2026-05 unverdicted novelty 5.0

Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
An End-to-End Framework for Building Large Language Models for Software Operations
cs.LG 2026-04 unverdicted novelty 5.0

OpsLLM outperforms general LLMs on software operations QA and RCA tasks through human-in-the-loop data curation, supervised fine-tuning, and domain-specific reinforcement learning.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation
cs.SE 2025-06 unverdicted novelty 5.0

AdaDec improves Pass@1 accuracy of LLM code generation by up to 20.9% over greedy decoding by triggering lookahead reranking only at high-uncertainty steps on HumanEval+, MBPP+, and DevEval.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
OpenCompass: A Universal Evaluation Platform for Large Language Models
cs.CL 2026-05 conditional novelty 4.0

OpenCompass is a modular, high-concurrency platform for unified LLM evaluation across knowledge, reasoning, code, and other domains with support for rule-based, LLM-as-judge, and cascaded evaluators.
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
cs.SE 2026-04 unverdicted novelty 4.0

Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
An End-to-End Framework for Building Large Language Models for Software Operations
cs.LG 2026-04 unverdicted novelty 4.0

OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.
Qwen2.5-Coder Technical Report
cs.CL 2024-09 unverdicted novelty 4.0

Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 45 Pith papers

[1]

Later, OpenCodeInterpreter (Zheng et al., 2024b) developed a multi-turn instruction dataset and achieved better coding performance

are instruction-tuned with synthetic data containing diverse instruction-code pairs. Later, OpenCodeInterpreter (Zheng et al., 2024b) developed a multi-turn instruction dataset and achieved better coding performance. More recently, there has been a growing interest in agentic programming (Yang et al.), developing prompting systems to enhance the capabilit...

work page 2021
[2]

How well do the models generalize to the unseen tools and tasks?

for task completion and reasoning provide the possibility towards artificial general intelligence. Our goal is to provide the community with the most open, reliable, and scalable evaluations to truly understand the fundamental capabilities of LLMs for programming, pinpointing the ways to unleash their power. G.1 L IMITATIONS Given the limited time and bud...

work page 2014
[3]

This means when you see the function stub and docstring, you should be able to implement ← - with exactly the same functionality with the given function body

refine the function including its docstrings in order to make the function more realistic and less ← - ambiguous. This means when you see the function stub and docstring, you should be able to implement ← - with exactly the same functionality with the given function body

work page
[4]

hello")

write blackbox unit tests to ensure the functional correctness of the given function. You should also ← - make the function easy to test. ### Step1:Check Library Imports #### Import Statement - Remove the library imports that are not used in the code. - Import libraries before the function declaration. #### Library Usage - Check if the usage of these libr...

work page 2025
[5]

This prevents the ← - user from inferring the function’s purpose based on its name

‘Function Name‘ has not been obfuscated: - The given function should have a generic name such as ‘f‘ to ensure anonymity. This prevents the ← - user from inferring the function’s purpose based on its name. - Example: Before: ‘def calculate_average(nums):‘ After: ‘def f(nums):‘

work page
[6]

""Calculates something

‘Docstring‘ is unclear, ambiguous, impractical or not well aligned with ‘Solution‘: - The function’s docstring should provide a clear and concise description of its purpose, expected ← - inputs, outputs, and examples of usage. If the description is vague or doesn’t match the function’s ← - behavior, it can lead to confusion. - Example: Before: ‘"""Calcula...

work page
[7]

- Example: Before: ‘import math‘ (but no usage of ‘math‘ in the function) After: Remove ‘import math‘ or ensure it’s used in the function

‘Solution‘ does not use all imported libraries or APIs: - If libraries are imported but not used in the function, it indicates redundant code or a mismatch ← - between the problem description and the solution. - Example: Before: ‘import math‘ (but no usage of ‘math‘ in the function) After: Remove ‘import math‘ or ensure it’s used in the function

work page
[8]

- Example: If using ‘sqrt‘ from ‘math‘ library in the function, ensure ‘from math import sqrt‘ is present at ← - the beginning

‘Solution‘ uses APIs that are not included in ‘Import Statement‘: - All external libraries or functions used in the solution should be imported at the beginning of the ← - script to ensure the code runs without errors. - Example: If using ‘sqrt‘ from ‘math‘ library in the function, ensure ‘from math import sqrt‘ is present at ← - the beginning

work page
[9]

- Example: If the problem is to calculate the square root, the solution should leverage the ‘math.sqrt‘ ← - function

‘Solution‘ does not use any library APIs: - The problem should be designed in a way that requires the usage of library APIs to solve it, ← - ensuring the challenge of integrating external tools. - Example: If the problem is to calculate the square root, the solution should leverage the ‘math.sqrt‘ ← - function

work page
[10]

- Example: Before: ‘random.randint(1,10)‘ After: ‘random.seed(seed); random.randint(1,10)‘

‘Solution‘ uses APIs in ‘random‘, but does not pass a random seed to ‘Function Parameters‘: - When using random functionalities, for reproducibility, it’s good practice to allow the user to set ← - a seed. - Example: Before: ‘random.randint(1,10)‘ After: ‘random.seed(seed); random.randint(1,10)‘

work page
[11]

- Example: Before: ‘# TODO: Implement this‘ After: Actual implementation of the required logic

‘Solution‘ contains dummy code: - Placeholder or dummy code should be replaced with actual implementation to ensure the function works ← - as expected. - Example: Before: ‘# TODO: Implement this‘ After: Actual implementation of the required logic

work page
[12]

40 Published as a conference paper at ICLR 2025

Unused global constants before ‘Problem Function‘: - Any constants or variables that are not used in the solution should be removed to clean up the code. 40 Published as a conference paper at ICLR 2025

work page 2025
[13]

‘TestCases‘ uses libraries or APIs that are not included in ‘Import Statement‘: - Similar to the solution, all external libraries or functions used in the test cases should be ← - imported

work page
[14]

‘TestCases‘ contains test cases that do not work for ‘Solution‘: - All test cases should be aligned with the function’s behavior to ensure they test the function ← - correctly

work page
[15]

For example, when plotting data on a graph, you ← - might get an ‘AxesSubplot‘ object in return

‘TestCases‘ does not test all attributes of the returned object: - If the function returns an object with multiple attributes or methods, the test cases should ← - validate all of them to ensure complete coverage. For example, when plotting data on a graph, you ← - might get an ‘AxesSubplot‘ object in return. This object has various attributes, like the t...

work page
[16]

‘TestCases‘ does not test the files that result in ‘Solution‘: - If the function creates or modifies files, the test cases should validate these files to ensure the ← - function works as expected

work page
[17]

‘TestCases‘ is wrapped in ‘run_tests‘: - The test cases and the function to run them should be separated for clarity

work page
[18]

Test cases in ‘TestCases‘ are duplicated or used to test the same behavior: - Redundant test cases should be removed to keep the test suite concise and focused

work page
[19]

\nLet’s think step by step

Test data used in ‘TestCases‘ is missing: - All required data for testing should be provided or generated to ensure the test cases can run ← - without issues. K E VALUATION SETUP K.1 I NFERENCE We perform all the model inference on A100 GPUs, except for the closed ones. For the closed models, we rely on their official APIs provided in the documents. K.2 E...

work page 2023

[1] [1]

Later, OpenCodeInterpreter (Zheng et al., 2024b) developed a multi-turn instruction dataset and achieved better coding performance

are instruction-tuned with synthetic data containing diverse instruction-code pairs. Later, OpenCodeInterpreter (Zheng et al., 2024b) developed a multi-turn instruction dataset and achieved better coding performance. More recently, there has been a growing interest in agentic programming (Yang et al.), developing prompting systems to enhance the capabilit...

work page 2021

[2] [2]

How well do the models generalize to the unseen tools and tasks?

for task completion and reasoning provide the possibility towards artificial general intelligence. Our goal is to provide the community with the most open, reliable, and scalable evaluations to truly understand the fundamental capabilities of LLMs for programming, pinpointing the ways to unleash their power. G.1 L IMITATIONS Given the limited time and bud...

work page 2014

[3] [3]

This means when you see the function stub and docstring, you should be able to implement ← - with exactly the same functionality with the given function body

refine the function including its docstrings in order to make the function more realistic and less ← - ambiguous. This means when you see the function stub and docstring, you should be able to implement ← - with exactly the same functionality with the given function body

work page

[4] [4]

hello")

write blackbox unit tests to ensure the functional correctness of the given function. You should also ← - make the function easy to test. ### Step1:Check Library Imports #### Import Statement - Remove the library imports that are not used in the code. - Import libraries before the function declaration. #### Library Usage - Check if the usage of these libr...

work page 2025

[5] [5]

This prevents the ← - user from inferring the function’s purpose based on its name

‘Function Name‘ has not been obfuscated: - The given function should have a generic name such as ‘f‘ to ensure anonymity. This prevents the ← - user from inferring the function’s purpose based on its name. - Example: Before: ‘def calculate_average(nums):‘ After: ‘def f(nums):‘

work page

[6] [6]

""Calculates something

‘Docstring‘ is unclear, ambiguous, impractical or not well aligned with ‘Solution‘: - The function’s docstring should provide a clear and concise description of its purpose, expected ← - inputs, outputs, and examples of usage. If the description is vague or doesn’t match the function’s ← - behavior, it can lead to confusion. - Example: Before: ‘"""Calcula...

work page

[7] [7]

- Example: Before: ‘import math‘ (but no usage of ‘math‘ in the function) After: Remove ‘import math‘ or ensure it’s used in the function

‘Solution‘ does not use all imported libraries or APIs: - If libraries are imported but not used in the function, it indicates redundant code or a mismatch ← - between the problem description and the solution. - Example: Before: ‘import math‘ (but no usage of ‘math‘ in the function) After: Remove ‘import math‘ or ensure it’s used in the function

work page

[8] [8]

- Example: If using ‘sqrt‘ from ‘math‘ library in the function, ensure ‘from math import sqrt‘ is present at ← - the beginning

‘Solution‘ uses APIs that are not included in ‘Import Statement‘: - All external libraries or functions used in the solution should be imported at the beginning of the ← - script to ensure the code runs without errors. - Example: If using ‘sqrt‘ from ‘math‘ library in the function, ensure ‘from math import sqrt‘ is present at ← - the beginning

work page

[9] [9]

- Example: If the problem is to calculate the square root, the solution should leverage the ‘math.sqrt‘ ← - function

‘Solution‘ does not use any library APIs: - The problem should be designed in a way that requires the usage of library APIs to solve it, ← - ensuring the challenge of integrating external tools. - Example: If the problem is to calculate the square root, the solution should leverage the ‘math.sqrt‘ ← - function

work page

[10] [10]

- Example: Before: ‘random.randint(1,10)‘ After: ‘random.seed(seed); random.randint(1,10)‘

‘Solution‘ uses APIs in ‘random‘, but does not pass a random seed to ‘Function Parameters‘: - When using random functionalities, for reproducibility, it’s good practice to allow the user to set ← - a seed. - Example: Before: ‘random.randint(1,10)‘ After: ‘random.seed(seed); random.randint(1,10)‘

work page

[11] [11]

- Example: Before: ‘# TODO: Implement this‘ After: Actual implementation of the required logic

‘Solution‘ contains dummy code: - Placeholder or dummy code should be replaced with actual implementation to ensure the function works ← - as expected. - Example: Before: ‘# TODO: Implement this‘ After: Actual implementation of the required logic

work page

[12] [12]

40 Published as a conference paper at ICLR 2025

Unused global constants before ‘Problem Function‘: - Any constants or variables that are not used in the solution should be removed to clean up the code. 40 Published as a conference paper at ICLR 2025

work page 2025

[13] [13]

‘TestCases‘ uses libraries or APIs that are not included in ‘Import Statement‘: - Similar to the solution, all external libraries or functions used in the test cases should be ← - imported

work page

[14] [14]

‘TestCases‘ contains test cases that do not work for ‘Solution‘: - All test cases should be aligned with the function’s behavior to ensure they test the function ← - correctly

work page

[15] [15]

For example, when plotting data on a graph, you ← - might get an ‘AxesSubplot‘ object in return

‘TestCases‘ does not test all attributes of the returned object: - If the function returns an object with multiple attributes or methods, the test cases should ← - validate all of them to ensure complete coverage. For example, when plotting data on a graph, you ← - might get an ‘AxesSubplot‘ object in return. This object has various attributes, like the t...

work page

[16] [16]

‘TestCases‘ does not test the files that result in ‘Solution‘: - If the function creates or modifies files, the test cases should validate these files to ensure the ← - function works as expected

work page

[17] [17]

‘TestCases‘ is wrapped in ‘run_tests‘: - The test cases and the function to run them should be separated for clarity

work page

[18] [18]

Test cases in ‘TestCases‘ are duplicated or used to test the same behavior: - Redundant test cases should be removed to keep the test suite concise and focused

work page

[19] [19]

\nLet’s think step by step

Test data used in ‘TestCases‘ is missing: - All required data for testing should be provided or generated to ensure the test cases can run ← - without issues. K E VALUATION SETUP K.1 I NFERENCE We perform all the model inference on A100 GPUs, except for the closed ones. For the closed models, we rely on their official APIs provided in the documents. K.2 E...

work page 2023