The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
Can it edit? evaluating the ability of large language models to follow code editing instructions
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
citing papers explorer
-
Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks
The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?
SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.