Using frontier models to synthesize plausible-but-wrong FIM completions as hard negatives for SFT improves Delulu exact match by +18.8 and edit similarity by +0.22 on Qwen2.5-Coder-7B while also lifting HumanEval-Infilling and SAFIM.
URL https://dl.acm.org/doi/abs/10.1145/3520312.3534864
10 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
DEBENCH shows the best decompiler-LLM pair reaches only 22.3% program-level behavioral overlap and 1.2% exact stdout match, with decompiler engines driving 20x more variation than LLMs.
Mixed-methods study shows developers prefer GenAI for repetitive tasks, benefit from single interaction modes but not combined ones, and gain awareness from study participation.
ContentFuzz rewrites posts with LLM guidance from stance model confidence to flip machine labels without altering human intent, tested across four models and three datasets in two languages.
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
Survey of 868 scientific programmers shows generative AI adoption is highest among the inexperienced, who prefer conversational tools, and perceived productivity correlates most with volume of accepted generated code rather than validation practices.
User study finds that task difficulty affects keystroke dynamics during LLM prompting as a marker of cognitive effort, while device type has weaker effects and keystrokes do not predict perceived output usefulness.
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Engineering choices for tools, safety guardrails, and human oversight determine whether an internal coding agent delivers value in practice more than the underlying model quality.
citing papers explorer
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.