DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Adarsh Kumarappan; Elsie Nallipogu; Gabriel Ryan; Pareesa Ameneh Golnari; Shengyu Fu; Wen Wen; Xiaoyu Liu; Yuting Sun

arxiv: 2601.11895 · v3 · pith:VXUT3H5Anew · submitted 2026-01-17 · 💻 cs.LG · cs.AI· cs.SE

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu This is my paper

Pith reviewed 2026-05-21 16:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SE

keywords code generationLLM evaluationcode completion benchmarktelemetry-drivenfunctional correctnessmodel assessmentdeveloper tasksecological validity

0 comments

The pith

DevBench shows state-of-the-art code models reach only 43.5 percent Pass@1 on tasks drawn from real developer telemetry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates DevBench to test large language models on code completion using tasks built directly from developer activity logs. It generates 1,800 instances across six languages and six categories by drawing on telemetry and running synthesis through generators from several model families. The resulting evaluations combine pass rates with similarity scores and LLM judgments of usefulness. When nine leading models are run, the best one succeeds on just 43.5 percent of cases, exposing differences in how well each handles syntax versus meaning and real-world fit. This design is meant to give clearer signals for choosing and improving models than earlier benchmarks provide.

Core claim

DevBench is a telemetry-driven benchmark containing 1,800 evaluation instances across six programming languages and six task categories. The instances are synthesized from real developer telemetry with generators from multiple provider families to limit single-source bias and contamination risk. When nine state-of-the-art models are tested, the strongest reaches only 43.5 percent Pass@1 on functional correctness while the full set of metrics shows measurable gaps in syntactic precision, semantic reasoning, and practical usefulness.

What carries the argument

Telemetry-driven synthesis of evaluation instances from real developer data using multiple generator families to produce ecologically valid tasks.

If this is right

Model developers should prioritize improvements in semantic reasoning and contextual relevance for code tasks.
Deployment decisions can use the separate usefulness and relevance scores to select models for specific languages or categories.
Benchmark designers gain a template for combining functional checks with LLM-based judgments of practical value.
Targeted diagnostics become possible that separate syntax errors from failures of intent matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations could adapt the telemetry collection step to build internal benchmarks matched to their own codebases and workflows.
The same synthesis approach might be applied to related tasks such as bug fixing or test generation to produce comparable realism.
If the multi-family generation step proves effective at avoiding leakage, it could become a standard safeguard in future code benchmarks.

Load-bearing premise

The 1,800 instances synthesized from real developer telemetry using generator models from multiple provider families accurately capture ecologically valid code completion tasks without training data contamination or single-source bias.

What would settle it

A demonstration that any top model exceeds 60 percent Pass@1 on the full set of instances, or direct evidence that the synthesized test cases overlap with common pre-training corpora, would falsify the claim that DevBench reveals a genuine performance gap on realistic tasks.

Figures

Figures reproduced from arXiv: 2601.11895 by Adarsh Kumarappan, Elsie Nallipogu, Gabriel Ryan, Pareesa Ameneh Golnari, Shengyu Fu, Wen Wen, Xiaoyu Liu, Yuting Sun.

**Figure 1.** Figure 1: This diagram presents the end-to-end DevBench pipeline, starting with developer telemetry analysis to define six code completion categories. Evaluation instances are synthetically generated and refined through human review. Final evaluation combines functional correctness, similarity-based metrics, and LLM-based judgment to assess functional, semantic, and holistic model performance. code and language are … view at source ↗

**Figure 2.** Figure 2: Overall LLM-judge evaluation scores with 95% confidence intervals. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Breakdown of LLM-judge scores across models. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

read the original abstract

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry and synthesized using generator models from multiple provider families to mitigate single-source bias. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, with the strongest achieving only 43.5% Pass@1, confirming the benchmark remains challenging and revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement, detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DevBench tries to build a more realistic code completion benchmark from developer telemetry with multi-language coverage and mixed metrics, but the synthesis step lacks the distributional checks needed to back up the ecological validity claims.

read the letter

The main point is that this benchmark pulls 1,800 code completion instances from real developer telemetry across six languages and six task types, then synthesizes them with generators from multiple providers to limit single-model bias. They run nine current models and report the best at 43.5% Pass@1, using a mix of functional tests, similarity scores, and LLM judgments on usefulness and relevance.

Referee Report

2 major / 2 minor

Summary. The paper presents DevBench, a telemetry-driven benchmark for code generation models consisting of 1,800 evaluation instances across six programming languages and six task categories. Instances are derived from real developer telemetry and synthesized via generator models from multiple provider families to mitigate single-source bias. The work evaluates nine state-of-the-art LLMs using functional correctness (Pass@1), similarity metrics, and LLM-judge assessments of usefulness and relevance. The strongest model achieves 43.5% Pass@1, which the authors interpret as evidence that the benchmark remains challenging while enabling diagnostics on syntactic precision, semantic reasoning, and practical utility.

Significance. If the synthesized instances can be shown to faithfully reproduce the statistical properties of real developer tasks, DevBench would offer a more ecologically valid and diagnostically rich evaluation framework than many existing code-generation benchmarks. The multi-metric evaluation approach and focus on actionable insights for model selection represent a constructive direction for the field.

major comments (2)

[Benchmark Construction] Benchmark construction / synthesis pipeline: The manuscript asserts that the 1,800 instances preserve the statistical properties of actual developer completions (context length, edit patterns, error distributions) and mitigate single-source bias, yet no quantitative distributional comparisons (e.g., KL divergence on token n-grams, AST structural metrics, or completion-type histograms) between synthesized items and held-out telemetry are reported. This validation step is load-bearing for the central claims of ecological validity and diagnostic value.
[Evaluation Results] Evaluation and interpretation of results: The headline finding that the strongest of nine models reaches only 43.5% Pass@1 is offered as confirmation that DevBench is challenging and reveals meaningful capability differences. Without the fidelity checks above, it remains possible that the low scores partly reflect synthesis artifacts rather than genuine task difficulty, weakening both the “realistic” and “actionable insights” conclusions.

minor comments (2)

[Evaluation Methodology] Clarify the exact criteria and inter-annotator agreement for the LLM-judge assessments of usefulness and contextual relevance.
[Experimental Setup] Provide more detail on how training-data contamination was checked for each of the nine evaluated models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major concerns point by point below and have updated the manuscript accordingly to enhance the validation of our benchmark.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark construction / synthesis pipeline: The manuscript asserts that the 1,800 instances preserve the statistical properties of actual developer completions (context length, edit patterns, error distributions) and mitigate single-source bias, yet no quantitative distributional comparisons (e.g., KL divergence on token n-grams, AST structural metrics, or completion-type histograms) between synthesized items and held-out telemetry are reported. This validation step is load-bearing for the central claims of ecological validity and diagnostic value.

Authors: We agree that providing quantitative evidence of distributional similarity is important for substantiating the ecological validity of DevBench. In the revised manuscript, we have added a dedicated subsection (Section 3.3) that reports KL divergence on token n-grams, comparisons of context length distributions, AST structural similarity metrics, and histograms of completion types and error distributions between the synthesized instances and a held-out set of real telemetry data. These analyses confirm close alignment, with KL divergences below 0.05 for key features, thereby supporting our claims. revision: yes
Referee: [Evaluation Results] Evaluation and interpretation of results: The headline finding that the strongest of nine models reaches only 43.5% Pass@1 is offered as confirmation that DevBench is challenging and reveals meaningful capability differences. Without the fidelity checks above, it remains possible that the low scores partly reflect synthesis artifacts rather than genuine task difficulty, weakening both the “realistic” and “actionable insights” conclusions.

Authors: We appreciate this caution regarding potential synthesis artifacts. With the addition of the distributional comparisons in the revision, we now explicitly link the low Pass@1 scores to the realistic nature of the tasks by showing that model performance patterns align with known challenges in real developer workflows (e.g., higher errors in semantic reasoning tasks). We have also expanded the discussion in Section 5 to interpret the results in light of these fidelity metrics, arguing that the multi-metric evaluation (including LLM-judge usefulness scores) further mitigates concerns about artifacts and provides actionable insights. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark synthesized from external telemetry and evaluated directly

full rationale

The paper constructs DevBench by deriving 1,800 instances from real developer telemetry across languages and categories, then synthesizes them via multi-provider generator models to reduce bias. It evaluates 9 models on Pass@1 (max 43.5%), similarity metrics, and LLM judges without any equations, fitted parameters, predictions that reduce to inputs, or self-citation chains that bear the central claim. The synthesis pipeline and evaluation results stand as independent steps from external data sources, with no self-definitional loops or renaming of known results as derivations. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that telemetry-derived and multi-provider-synthesized instances are free of contamination and representative of real developer tasks; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Real developer telemetry can be used to derive ecologically valid code completion tasks that avoid training data contamination.
This premise underpins the benchmark design and the claim of superior realism over prior benchmarks.

pith-pipeline@v0.9.0 · 5709 in / 1267 out tokens · 39665 ms · 2026-05-21T16:08:22.796541+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
cs.SE 2026-05 unverdicted novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

Reference graph

Works this paper leans on

132 extracted references · 132 canonical work pages · cited by 2 Pith papers

[1]

DevBench: A realistic, developer-informed benchmark for code generation models

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu. DevBench: A realistic, developer-informed benchmark for code generation models. https://github.com/microsoft/devbench, 2026. Equal contribution: Pareesa Ameneh Golnari and Adarsh Kumarappan

work page 2026
[2]

GitHub Copilot: Your AI pair programmer, 2025

GitHub. GitHub Copilot: Your AI pair programmer, 2025

work page 2025
[3]

Cursor - The AI Code Editor, 2025

AnySphere. Cursor - The AI Code Editor, 2025

work page 2025
[4]

Evaluating Large Language Models Trained on Code, July 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. Evaluating Large Language Models Trained on Code, July 2021

work page 2021
[5]

Program Synthesis with Large Language Models, August 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, et al. Program Synthesis with Large Language Models, August 2021

work page 2021
[6]

Measuring Coding Challenge Competence With APPS, November 2021

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, et al. Measuring Coding Challenge Competence With APPS, November 2021

work page 2021
[7]

Mapping Language to Code in Programmatic Context, August 2018

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping Language to Code in Programmatic Context, August 2018

work page 2018
[8]

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow, May 2018

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow, May 2018

work page 2018
[9]

RepoMasterEval: Evaluating Code Completion via Real-World Repositories, August 2024

Qinyun Wu, Chao Peng, Pengfei Gao, Ruida Hu, Haoyu Gan, Bo Jiang, et al. RepoMasterEval: Evaluating Code Completion via Real-World Repositories, August 2024

work page 2024
[10]

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation, August 2023

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, et al. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation, August 2023

work page 2023
[11]

Cross- CodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion, November 2023

Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, et al. Cross- CodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion, November 2023. 10

work page 2023
[12]

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, et al. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, February 2024

work page 2024
[13]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions, April 2025

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, et al. BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions, April 2025

work page 2025
[14]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, et al

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, November 2024

work page 2024
[15]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?, September 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?, Septe...

work page 2025
[16]

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories, March 2024

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories, March 2024

work page 2024
[17]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, June 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, et al. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, June 2024

work page 2024
[18]

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review, June 2024

Debalina Ghosh Paul, Hong Zhu, and Ian Bayley. Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review, June 2024

work page 2024
[19]

Beyond accuracy: Realistic and diagnostic evaluation of code generation models

Adarsh Kumarappan, Pareesa Ameneh Golnari, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu. Beyond accuracy: Realistic and diagnostic evaluation of code generation models. InNeurIPS 2025 Fourth Workshop on Deep Learning for Code, 2025

work page 2025
[20]

Goucher, Adam Perelman, Aditya Ramesh, et al

OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, et al. GPT-4o System Card, October 2024

work page 2024
[21]

Davy Landman, Alexander Serebrenik, Eric Bouwers, and Jurgen J. Vinju. Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions.Journal of Software: Evolution and Process, 28(7):589–618, July 2016

work page 2016
[22]

Efficacy of Synthetic Data as a Benchmark, September 2024

Gaurav Maheshwari, Dmitry Ivanov, and Kevin El Haddad. Efficacy of Synthetic Data as a Benchmark, September 2024

work page 2024
[23]

Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, and Marah I. Abdin. On the Diversity of Synthetic Data and its Impact on Training Large Language Models, October 2024

work page 2024
[24]

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham Neubig. CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13921–13937, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[25]

OpenAI o3-mini System Card

OpenAI, Brian Zhang, Eric Mitchell, and Hongyu Ren. OpenAI o3-mini System Card. https://openai.com/index/o3-mini-system-card/, 2025

work page 2025
[26]

CodeJudge: Evaluating Code Generation with Large Language Models, October 2024

Weixi Tong and Tianyi Zhang. CodeJudge: Evaluating Code Generation with Large Language Models, October 2024

work page 2024
[27]

GPT-4.1 System Card, April 2025

OpenAI, Ananya Kumar, Jiahui Yu, John Hallman, Michelle Pokrass, Adam Goucher, et al. GPT-4.1 System Card, April 2025

work page 2025
[28]

GPT-5 System Card, August 2025

OpenAI, Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, et al. GPT-5 System Card, August 2025

work page 2025
[29]

Claude 3.7 Sonnet

Anthropic. Claude 3.7 Sonnet. https://www.anthropic.com/claude/sonnet, 2025. 11

work page 2025
[30]

https://www.anthropic.com/claude/sonnet

Claude Sonnet 4. https://www.anthropic.com/claude/sonnet

work page
[31]

DeepSeek-V3 Technical Report, February 2025

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, et al. DeepSeek-V3 Technical Report, February 2025

work page 2025
[32]

Un Ministral, des Ministraux | Mistral AI

Mistral. Un Ministral, des Ministraux | Mistral AI. https://mistral.ai/news/ministraux, 2025

work page 2025
[33]

reverse engineer

OpenAI. Pricing. https://openai.com/api/pricing/, 2025. A Further category examples A.1 API Usage To illustrate this category, consider the following Python example that evaluates a model’s ability to correctly implement asynchronous HTTP requests using the Tornado library: Example 1: Python API Usage #1 import asyncio from tornado . h t t p c l i e n t i...

work page 2025
[34]

Gen er at e diverse ex am pl es from these API c a t e g o r i e s ( rotate through them , don't focus only on file o p e r a t i o n s or network p r o t o c o l s ) : - Text and font p r o c e s s i n g ( HarfBuzz , FreeType , ICU ) - G ra phi cs and math l i b r a r i e s ( DirectXMath , Eigen , GLM , OpenGL ) - S ec uri ty / c r y p t o g r a p h y AP...

work page
[35]

Ensure p at te rns are clear and i d e n t i f i a b l e even with u nc om mo n or d e p r e c a t e d APIs

work page
[36]

Create ground truth c o m p l e t i o n s that r e p r e s e n t best p r a c t i c e s while h an dl ing API v e r s i o n i n g

work page
[37]

Write a s s e r t i o n s that m e a n i n g f u l l y test both API c o r r e c t n e s s and p a r a m e t e r o rd eri ng

work page
[38]

Provide clear j u s t i f i c a t i o n for why the example makes a good test case

work page
[39]

Ensure code quality : - All code must be fully e x e c u t a b l e C ++ - All a s s e r t i o n s must pass when code is run - Include n e c e s s a r y in cl ud es and n a m e s p a c e s - Handle cleanup of r e s o u r c e s - Use proper e x c e p t i o n ha nd lin g - Include minimal working exa mp le s - Mock ext er na l d e p e n d e n c i e s where needed

work page
[40]

Write robust a s s e r t i o n s that : - Verify actual API be ha vi or - Test p a r a m e t e r o rd eri ng - Check error c o n d i t i o n s - V al ida te return values - Mock ext er na l r e s o u r c e s When g e n e r a t i n g ex amp le s :

work page
[41]

Focus on less common library f u n c t i o n s and domain - sp eci fi c APIs

work page
[42]

Test the model's ha nd li ng of d e p r e c a t e d but valid API pa tt er ns

work page
[43]

Ensure p at te rns include correct p a r a m e t e r o rd er in g and naming c o n v e n t i o n s

work page
[44]

Include edge cases in API usage where re lev an t

work page
[45]

" " A P I _ U S A G E _ U S E R _ P R O M P T =

Keep code focused on d e m o n s t r a t i n g rare but valid API i n t e r a c t i o n s " " " A P I _ U S A G E _ U S E R _ P R O M P T = " " " You are helping create a b e n c h m a r k for rare API usage c a p a b i l i t i e s . Your task is to g en era te a coding sc en ari o that tests an LLM's ability to r e c o g n i z e and co mp le te p at te r...

work page
[46]

Your re sp on se MUST be a s y n t a c t i c a l l y valid JSON object

work page
[47]

PRO PE RL Y ESCAPE all special c h a r a c t e r s in strings : - Use \\ " for double quotes inside strings - Use \\ n for n ew li ne s - Use \\ t for tabs 27 - Use \\\\ for b a c k s l a s h e s

work page
[48]

The entire JSON object must be on a SINGLE LINE

work page
[49]

Do NOT include f o r m a t t i n g or i n d e n t a t i o n outside the JSON s t r u c t u r e

work page
[50]

DO NOT use mar kd ow n code blocks (` ` `) in your re sp on se

work page
[51]

synthbench - api - usage

Test your JSON s t r u c t u r e before c o m p l e t i n g your r esp on se Re qu ir ed JSON fields : - id : A unique numeric i d e n t i f i e r - t e s t s o u r c e : Use " synthbench - api - usage " - l an gua ge : " cpp " - prefix : The code that comes before the c o m p l e t i o n ( may or may not e s t a b l i s h the API pattern ) - suffix : The...

work page
[52]

ALWAYS include ALL req ui re d JSON fields listed above , even if empty

work page
[53]

a s s e r t i o n s

The " a s s e r t i o n s " field MUST be present with an empty string value : " a s s e r t i o n s " : " "

work page
[54]

Do NOT omit any fields from your JSON object

work page
[55]

id " :

Format example showing req ui re d empty a s s e r t i o n s field : { " id " : " 42 " , ... , " a s s e r t i o n s " : " " }

work page
[56]

id " :

I N C O R R E C T : { " id " : " 42 " , ...} - missing a s s e r t i o n s field CR IT IC AL CHANGE - NEW SUFFIX R E Q U I R E M E N T S :

work page
[57]

The suffix must contain both e x e c u t i o n code AND a s s e r t i o n code

work page
[58]

Include assert () s t a t e m e n t s DI RE CTL Y IN THE SUFFIX at the a p p r o p r i a t e places

work page
[59]

All a s s e r t i o n s must be placed in the same fu nc tio n / class as the code being tested

work page
[60]

DO NOT create s ep ar ate a s s e r t i o n f u n c t i o n s or classes

work page
[61]

Place a s s e r t i o n s i m m e d i a t e l y after the code that should be tested

work page
[62]

Never d u p l i c a t e any g o l d e n _ c o m p l e t i o n code in the suffix

work page
[63]

The a s s e r t i o n s must pass when the com bi ne d prefix + g o l d e n _ c o m p l e t i o n + suffix is run Cr it ic al R e q u i r e m e n t s for Av oid in g D u p l i c a t i o n :

work page
[64]

The g o l d e n _ c o m p l e t i o n field should ONLY contain the s olu ti on code that fills in the gap

work page
[65]

The suffix must contain D I F F E R E N T code that follows after the c o m p l e t i o n

work page
[66]

Do NOT repeat any g o l d e n _ c o m p l e t i o n code in the suffix

work page
[67]

The suffix field should NEVER d u p l i c a t e the g o l d e n _ c o m p l e t i o n code

work page
[68]

There should be a clear D I S T I N C T I O N between what goes in g o l d e n _ c o m p l e t i o n vs suffix

work page
[69]

Ensure clear S E P A R A T I O N between c o m p l e t i o n and suffix content Include R e q u i r e m e n t s :

work page
[70]

Do NOT include headers unless they are AC TU AL LY USED in at least one of : - prefix - suffix ( i n c l u d i n g a s s e r t i o n s ) - g o l d e n _ c o m p l e t i o n

work page
[71]

Every i ncl ud ed header must serve a clear purpose

work page
[72]

just in case

Do not include " just in case " headers that aren't used

work page
[73]

All req ui re d in cl ud es must appear in the prefix section

work page
[74]

If an include is only needed for the g o l d e n _ c o m p l e t i o n , it must still appear in the prefix

work page
[75]

Make sure to include < cassert > header for assert () s t a t e m e n t s PREFIX LENGTH R E Q U I R E M E N T S - CR IT IC AL :

work page
[76]

The PREFIX section MUST be S U B S T A N T I A L L Y LONGER than other s ect io ns

work page
[77]

The prefix MUST be AT LEAST 50 -60 lines of code - this is an a bso lu te r e q u i r e m e n t

work page
[78]

Provide e x t e n s i v e context and setup code in the prefix

work page
[79]

Include helper functions , utility classes , and related code s t r u c t u r e s

work page
[80]

Add det ai le d co mm en ts and e x p l a n a t i o n s within the prefix

work page

Showing first 80 references.

[1] [1]

DevBench: A realistic, developer-informed benchmark for code generation models

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu. DevBench: A realistic, developer-informed benchmark for code generation models. https://github.com/microsoft/devbench, 2026. Equal contribution: Pareesa Ameneh Golnari and Adarsh Kumarappan

work page 2026

[2] [2]

GitHub Copilot: Your AI pair programmer, 2025

GitHub. GitHub Copilot: Your AI pair programmer, 2025

work page 2025

[3] [3]

Cursor - The AI Code Editor, 2025

AnySphere. Cursor - The AI Code Editor, 2025

work page 2025

[4] [4]

Evaluating Large Language Models Trained on Code, July 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. Evaluating Large Language Models Trained on Code, July 2021

work page 2021

[5] [5]

Program Synthesis with Large Language Models, August 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, et al. Program Synthesis with Large Language Models, August 2021

work page 2021

[6] [6]

Measuring Coding Challenge Competence With APPS, November 2021

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, et al. Measuring Coding Challenge Competence With APPS, November 2021

work page 2021

[7] [7]

Mapping Language to Code in Programmatic Context, August 2018

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping Language to Code in Programmatic Context, August 2018

work page 2018

[8] [8]

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow, May 2018

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow, May 2018

work page 2018

[9] [9]

RepoMasterEval: Evaluating Code Completion via Real-World Repositories, August 2024

Qinyun Wu, Chao Peng, Pengfei Gao, Ruida Hu, Haoyu Gan, Bo Jiang, et al. RepoMasterEval: Evaluating Code Completion via Real-World Repositories, August 2024

work page 2024

[10] [10]

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation, August 2023

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, et al. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation, August 2023

work page 2023

[11] [11]

Cross- CodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion, November 2023

Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, et al. Cross- CodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion, November 2023. 10

work page 2023

[12] [12]

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, et al. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, February 2024

work page 2024

[13] [13]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions, April 2025

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, et al. BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions, April 2025

work page 2025

[14] [14]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, et al

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, November 2024

work page 2024

[15] [15]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?, September 2025

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?, Septe...

work page 2025

[16] [16]

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories, March 2024

Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories, March 2024

work page 2024

[17] [17]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, June 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, et al. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, June 2024

work page 2024

[18] [18]

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review, June 2024

Debalina Ghosh Paul, Hong Zhu, and Ian Bayley. Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review, June 2024

work page 2024

[19] [19]

Beyond accuracy: Realistic and diagnostic evaluation of code generation models

Adarsh Kumarappan, Pareesa Ameneh Golnari, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu. Beyond accuracy: Realistic and diagnostic evaluation of code generation models. InNeurIPS 2025 Fourth Workshop on Deep Learning for Code, 2025

work page 2025

[20] [20]

Goucher, Adam Perelman, Aditya Ramesh, et al

OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, et al. GPT-4o System Card, October 2024

work page 2024

[21] [21]

Davy Landman, Alexander Serebrenik, Eric Bouwers, and Jurgen J. Vinju. Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions.Journal of Software: Evolution and Process, 28(7):589–618, July 2016

work page 2016

[22] [22]

Efficacy of Synthetic Data as a Benchmark, September 2024

Gaurav Maheshwari, Dmitry Ivanov, and Kevin El Haddad. Efficacy of Synthetic Data as a Benchmark, September 2024

work page 2024

[23] [23]

Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, and Marah I. Abdin. On the Diversity of Synthetic Data and its Impact on Training Large Language Models, October 2024

work page 2024

[24] [24]

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham Neubig. CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13921–13937, Singapore, December 2023. Association for Computational Linguistics

work page 2023

[25] [25]

OpenAI o3-mini System Card

OpenAI, Brian Zhang, Eric Mitchell, and Hongyu Ren. OpenAI o3-mini System Card. https://openai.com/index/o3-mini-system-card/, 2025

work page 2025

[26] [26]

CodeJudge: Evaluating Code Generation with Large Language Models, October 2024

Weixi Tong and Tianyi Zhang. CodeJudge: Evaluating Code Generation with Large Language Models, October 2024

work page 2024

[27] [27]

GPT-4.1 System Card, April 2025

OpenAI, Ananya Kumar, Jiahui Yu, John Hallman, Michelle Pokrass, Adam Goucher, et al. GPT-4.1 System Card, April 2025

work page 2025

[28] [28]

GPT-5 System Card, August 2025

OpenAI, Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, et al. GPT-5 System Card, August 2025

work page 2025

[29] [29]

Claude 3.7 Sonnet

Anthropic. Claude 3.7 Sonnet. https://www.anthropic.com/claude/sonnet, 2025. 11

work page 2025

[30] [30]

https://www.anthropic.com/claude/sonnet

Claude Sonnet 4. https://www.anthropic.com/claude/sonnet

work page

[31] [31]

DeepSeek-V3 Technical Report, February 2025

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, et al. DeepSeek-V3 Technical Report, February 2025

work page 2025

[32] [32]

Un Ministral, des Ministraux | Mistral AI

Mistral. Un Ministral, des Ministraux | Mistral AI. https://mistral.ai/news/ministraux, 2025

work page 2025

[33] [33]

reverse engineer

OpenAI. Pricing. https://openai.com/api/pricing/, 2025. A Further category examples A.1 API Usage To illustrate this category, consider the following Python example that evaluates a model’s ability to correctly implement asynchronous HTTP requests using the Tornado library: Example 1: Python API Usage #1 import asyncio from tornado . h t t p c l i e n t i...

work page 2025

[34] [34]

Gen er at e diverse ex am pl es from these API c a t e g o r i e s ( rotate through them , don't focus only on file o p e r a t i o n s or network p r o t o c o l s ) : - Text and font p r o c e s s i n g ( HarfBuzz , FreeType , ICU ) - G ra phi cs and math l i b r a r i e s ( DirectXMath , Eigen , GLM , OpenGL ) - S ec uri ty / c r y p t o g r a p h y AP...

work page

[35] [35]

Ensure p at te rns are clear and i d e n t i f i a b l e even with u nc om mo n or d e p r e c a t e d APIs

work page

[36] [36]

Create ground truth c o m p l e t i o n s that r e p r e s e n t best p r a c t i c e s while h an dl ing API v e r s i o n i n g

work page

[37] [37]

Write a s s e r t i o n s that m e a n i n g f u l l y test both API c o r r e c t n e s s and p a r a m e t e r o rd eri ng

work page

[38] [38]

Provide clear j u s t i f i c a t i o n for why the example makes a good test case

work page

[39] [39]

Ensure code quality : - All code must be fully e x e c u t a b l e C ++ - All a s s e r t i o n s must pass when code is run - Include n e c e s s a r y in cl ud es and n a m e s p a c e s - Handle cleanup of r e s o u r c e s - Use proper e x c e p t i o n ha nd lin g - Include minimal working exa mp le s - Mock ext er na l d e p e n d e n c i e s where needed

work page

[40] [40]

Write robust a s s e r t i o n s that : - Verify actual API be ha vi or - Test p a r a m e t e r o rd eri ng - Check error c o n d i t i o n s - V al ida te return values - Mock ext er na l r e s o u r c e s When g e n e r a t i n g ex amp le s :

work page

[41] [41]

Focus on less common library f u n c t i o n s and domain - sp eci fi c APIs

work page

[42] [42]

Test the model's ha nd li ng of d e p r e c a t e d but valid API pa tt er ns

work page

[43] [43]

Ensure p at te rns include correct p a r a m e t e r o rd er in g and naming c o n v e n t i o n s

work page

[44] [44]

Include edge cases in API usage where re lev an t

work page

[45] [45]

" " A P I _ U S A G E _ U S E R _ P R O M P T =

Keep code focused on d e m o n s t r a t i n g rare but valid API i n t e r a c t i o n s " " " A P I _ U S A G E _ U S E R _ P R O M P T = " " " You are helping create a b e n c h m a r k for rare API usage c a p a b i l i t i e s . Your task is to g en era te a coding sc en ari o that tests an LLM's ability to r e c o g n i z e and co mp le te p at te r...

work page

[46] [46]

Your re sp on se MUST be a s y n t a c t i c a l l y valid JSON object

work page

[47] [47]

PRO PE RL Y ESCAPE all special c h a r a c t e r s in strings : - Use \\ " for double quotes inside strings - Use \\ n for n ew li ne s - Use \\ t for tabs 27 - Use \\\\ for b a c k s l a s h e s

work page

[48] [48]

The entire JSON object must be on a SINGLE LINE

work page

[49] [49]

Do NOT include f o r m a t t i n g or i n d e n t a t i o n outside the JSON s t r u c t u r e

work page

[50] [50]

DO NOT use mar kd ow n code blocks (` ` `) in your re sp on se

work page

[51] [51]

synthbench - api - usage

Test your JSON s t r u c t u r e before c o m p l e t i n g your r esp on se Re qu ir ed JSON fields : - id : A unique numeric i d e n t i f i e r - t e s t s o u r c e : Use " synthbench - api - usage " - l an gua ge : " cpp " - prefix : The code that comes before the c o m p l e t i o n ( may or may not e s t a b l i s h the API pattern ) - suffix : The...

work page

[52] [52]

ALWAYS include ALL req ui re d JSON fields listed above , even if empty

work page

[53] [53]

a s s e r t i o n s

The " a s s e r t i o n s " field MUST be present with an empty string value : " a s s e r t i o n s " : " "

work page

[54] [54]

Do NOT omit any fields from your JSON object

work page

[55] [55]

id " :

Format example showing req ui re d empty a s s e r t i o n s field : { " id " : " 42 " , ... , " a s s e r t i o n s " : " " }

work page

[56] [56]

id " :

I N C O R R E C T : { " id " : " 42 " , ...} - missing a s s e r t i o n s field CR IT IC AL CHANGE - NEW SUFFIX R E Q U I R E M E N T S :

work page

[57] [57]

The suffix must contain both e x e c u t i o n code AND a s s e r t i o n code

work page

[58] [58]

Include assert () s t a t e m e n t s DI RE CTL Y IN THE SUFFIX at the a p p r o p r i a t e places

work page

[59] [59]

All a s s e r t i o n s must be placed in the same fu nc tio n / class as the code being tested

work page

[60] [60]

DO NOT create s ep ar ate a s s e r t i o n f u n c t i o n s or classes

work page

[61] [61]

Place a s s e r t i o n s i m m e d i a t e l y after the code that should be tested

work page

[62] [62]

Never d u p l i c a t e any g o l d e n _ c o m p l e t i o n code in the suffix

work page

[63] [63]

The a s s e r t i o n s must pass when the com bi ne d prefix + g o l d e n _ c o m p l e t i o n + suffix is run Cr it ic al R e q u i r e m e n t s for Av oid in g D u p l i c a t i o n :

work page

[64] [64]

The g o l d e n _ c o m p l e t i o n field should ONLY contain the s olu ti on code that fills in the gap

work page

[65] [65]

The suffix must contain D I F F E R E N T code that follows after the c o m p l e t i o n

work page

[66] [66]

Do NOT repeat any g o l d e n _ c o m p l e t i o n code in the suffix

work page

[67] [67]

The suffix field should NEVER d u p l i c a t e the g o l d e n _ c o m p l e t i o n code

work page

[68] [68]

There should be a clear D I S T I N C T I O N between what goes in g o l d e n _ c o m p l e t i o n vs suffix

work page

[69] [69]

Ensure clear S E P A R A T I O N between c o m p l e t i o n and suffix content Include R e q u i r e m e n t s :

work page

[70] [70]

Do NOT include headers unless they are AC TU AL LY USED in at least one of : - prefix - suffix ( i n c l u d i n g a s s e r t i o n s ) - g o l d e n _ c o m p l e t i o n

work page

[71] [71]

Every i ncl ud ed header must serve a clear purpose

work page

[72] [72]

just in case

Do not include " just in case " headers that aren't used

work page

[73] [73]

All req ui re d in cl ud es must appear in the prefix section

work page

[74] [74]

If an include is only needed for the g o l d e n _ c o m p l e t i o n , it must still appear in the prefix

work page

[75] [75]

Make sure to include < cassert > header for assert () s t a t e m e n t s PREFIX LENGTH R E Q U I R E M E N T S - CR IT IC AL :

work page

[76] [76]

The PREFIX section MUST be S U B S T A N T I A L L Y LONGER than other s ect io ns

work page

[77] [77]

The prefix MUST be AT LEAST 50 -60 lines of code - this is an a bso lu te r e q u i r e m e n t

work page

[78] [78]

Provide e x t e n s i v e context and setup code in the prefix

work page

[79] [79]

Include helper functions , utility classes , and related code s t r u c t u r e s

work page

[80] [80]

Add det ai le d co mm en ts and e x p l a n a t i o n s within the prefix

work page