SWE-IF: Aligning Code Evaluation with Human Preference

Benoit Schillings; Dan Garrette; Jeremiah Liu; Jiao Sun; Jiawei Han; Ming Zhong; Nan Xu; Qingze Wang; Shyam Upadhyay; Ting-Yun Chang

arxiv: 2510.07315 · v2 · pith:CZPVNULInew · submitted 2025-10-08 · 💻 cs.CL · cs.AI· cs.LG· cs.SE

SWE-IF: Aligning Code Evaluation with Human Preference

Ming Zhong , Xiang Zhou , Ting-Yun Chang , Qingze Wang , Nan Xu , Xiance Si , Dan Garrette , Shyam Upadhyay

show 4 more authors

Jeremiah Liu Jiawei Han Benoit Schillings Jiao Sun

This is my paper

classification 💻 cs.CL cs.AIcs.LGcs.SE

keywords codefunctionalcorrectnessfollowinginstructionllmsvibecheck

0 comments

read the original abstract

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check besides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in SWE-IF, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly, a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs. Our code, data, and taxonomy are available at https://github.com/maszhongming/SWE-IF.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
cs.SE 2026-04 unverdicted novelty 7.0

Frontier LLMs pass unit tests over 76% of the time on debugging tasks but achieve edit precision below 45%, indicating regeneration rather than precise debugging.
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
cs.SE 2026-04 unverdicted novelty 7.0

The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.