pith. sign in

arxiv: 2510.07315 · v2 · pith:CZPVNULInew · submitted 2025-10-08 · 💻 cs.CL · cs.AI· cs.LG· cs.SE

SWE-IF: Aligning Code Evaluation with Human Preference

classification 💻 cs.CL cs.AIcs.LGcs.SE
keywords codefunctionalcorrectnessfollowinginstructionllmsvibecheck
0
0 comments X
read the original abstract

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check besides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in SWE-IF, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly, a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs. Our code, data, and taxonomy are available at https://github.com/maszhongming/SWE-IF.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

    cs.SE 2026-04 unverdicted novelty 7.0

    Frontier LLMs pass unit tests over 76% of the time on debugging tasks but achieve edit precision below 45%, indicating regeneration rather than precise debugging.

  2. Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

    cs.SE 2026-04 unverdicted novelty 7.0

    The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.