Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal; Ameet Talwalkar; Chris Donahue; Ryan Shar; Shyam Agarwal; Tongshuang Wu; Valerie Chen; Wayne Chi; Zichu Wu

arxiv: 2603.24586 · v2 · pith:WZNM7LOHnew · submitted 2026-03-25 · 💻 cs.SE · cs.CL

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal , Ryan Shar , Zichu Wu , Shyam Agarwal , Tongshuang Wu , Chris Donahue , Ameet Talwalkar , Wayne Chi

show 1 more author

Valerie Chen

This is my paper

Pith reviewed 2026-05-15 07:12 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords LLM judgescode evaluationhuman preferencesbias alignmentrubric analysissoftware engineering criteriainteractive coding

0 comments

The pith

LLM judges underperform human annotators by 12-23% when predicting developer code preferences across realistic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TRACE, a framework that tests how well LLM judges match human developer preferences in interactive code scenarios. It evaluates 13 models on chat-based programming, IDE autocompletion, and instructed code editing. Even the strongest models fall short of human annotators by 12-23 percent. TRACE automatically extracts rubrics and uncovers 35 significant sources of misalignment, most of which map to established software engineering code quality criteria. Examples include judges favoring longer explanations where humans prefer shorter ones, exposing systematic gaps when LLMs serve as evaluators in practical coding workflows.

Core claim

TRACE demonstrates that LLM judges exhibit systematic misalignment with human preferences on code quality dimensions, with the best models underperforming human annotators by 12-23% across three interaction modalities.

What carries the argument

TRACE (Tool for Rubric Analysis in Code Evaluation), which measures LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal biases.

If this is right

Judges show bias toward longer code explanations in chat-based coding while humans prefer shorter ones.
Misalignment appears on the majority of existing code quality dimensions.
LLM judge evaluation requires realistic interactive settings that include partial context and ambiguous intent.
Automatic rubric extraction can isolate sources of human-model disagreement on code quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Addressing the identified biases through targeted fine-tuning could make LLM judges more reliable for automated code review.
The same misalignment patterns may extend to LLM evaluation in non-code domains such as writing or design tasks.
Refining how human preferences are collected could reduce noise in the ground truth used to train or benchmark judges.

Load-bearing premise

Human preferences collected via annotation are treated as the ground truth without significant bias or inconsistency.

What would settle it

Repeating the annotation with a fresh group of developers on the same code examples and checking whether the 35 identified misalignment sources stay consistent.

read the original abstract

As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and instructed code editing -- we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM judges lag human coders by 12-23% on preference matching with 35 flagged misalignments, but the work rests on unverified human labels as ground truth.

read the letter

The main takeaway is that the best of 13 LLM judges still trail human annotators by 12-23% when predicting developer preferences on code, and the TRACE framework surfaces 35 specific misalignment points that mostly track existing software engineering criteria like explanation length and style. The paper applies this across chat-based coding, IDE autocompletion, and instructed editing, which moves the comparison closer to how these tools actually get used. The automatic rubric extraction is a useful addition because it turns the gaps into concrete, inspectable items rather than just aggregate scores. The numbers are reported plainly and the setup tests multiple models on realistic partial-context tasks, which is better than many prior LLM-as-judge studies that stay in abstract benchmarks. The central empirical claim holds up as a solid extension of earlier bias work in SE and NLP. The soft spot is the treatment of human annotations as the reference standard. No inter-annotator agreement figures or consistency checks appear in the reported methods, so some portion of the measured gaps and the extracted misalignments could trace to noise in the human labels rather than systematic model bias. The rubric extraction step also lacks shown robustness tests against prompt or model variation. This is a practical paper for people building or auditing LLM code tools who need targets for alignment work. It has enough new quantitative results and tooling to deserve a serious referee, though reviewers will likely ask for the missing annotation reliability details and extraction validation. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper introduces the TRACE framework to evaluate LLM judges against human developer preferences in realistic code evaluation settings across three modalities (chat-based programming, IDE autocompletion, and instructed code editing). Using annotations from human developers, it reports that the best of 13 LLM judges underperform humans by 12-23% and automatically extracts 35 significant sources of misalignment, the majority of which map to established software engineering code quality criteria (e.g., preference for shorter explanations in chat-based coding).

Significance. If the central empirical claims hold after addressing methodological gaps, the work would be significant for software engineering and LLM evaluation research. It supplies a concrete tool for rubric extraction and bias quantification in interactive coding contexts, with quantitative gaps and explicit ties to existing SE criteria that could guide improvements in automated code judges.

major comments (2)

[Methods / Annotation Protocol] The manuscript treats collected human annotations as ground truth for the headline 12-23% performance gaps and the 35 misalignments without reporting inter-annotator agreement (Cohen's/Fleiss' kappa, percentage agreement, or equivalent) or consistency metrics. This is load-bearing for the central claims in the abstract and results sections, as substantial annotator disagreement would imply that measured LLM-human gaps partly reflect label noise rather than systematic bias.
[TRACE Framework / Rubric Extraction] The automatic rubric extraction component of TRACE lacks robustness checks; there is no reported analysis of sensitivity to prompt variations, extraction-model choice, or threshold settings used to identify the 35 significant misalignments. This directly affects the reliability of the misalignment counts and their correspondence to SE criteria.

minor comments (2)

[Abstract and Results] The abstract and results should explicitly state the statistical threshold or test used to declare a misalignment 'significant' and the precise metric (accuracy, agreement rate, etc.) underlying the 12-23% underperformance figure.
[Experimental Setup] Clarify the sample sizes, number of code snippets per modality, and exact annotation instructions provided to human developers to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have helped us strengthen the methodological presentation of our work. We address each major comment point by point below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Methods / Annotation Protocol] The manuscript treats collected human annotations as ground truth for the headline 12-23% performance gaps and the 35 misalignments without reporting inter-annotator agreement (Cohen's/Fleiss' kappa, percentage agreement, or equivalent) or consistency metrics. This is load-bearing for the central claims in the abstract and results sections, as substantial annotator disagreement would imply that measured LLM-human gaps partly reflect label noise rather than systematic bias.

Authors: We agree that reporting inter-annotator agreement is essential for validating the human annotations as a reliable reference. In the revised manuscript we have added Fleiss' kappa and percentage agreement statistics for each of the three modalities. The computed values (kappa ranging from 0.67 to 0.81) indicate substantial agreement. These metrics are now presented in a new subsection of the Methods section and summarized in Table 2. The added analysis confirms that the reported 12-23% gaps and the extracted misalignments are not driven by label noise. revision: yes
Referee: [TRACE Framework / Rubric Extraction] The automatic rubric extraction component of TRACE lacks robustness checks; there is no reported analysis of sensitivity to prompt variations, extraction-model choice, or threshold settings used to identify the 35 significant misalignments. This directly affects the reliability of the misalignment counts and their correspondence to SE criteria.

Authors: We acknowledge the need for explicit robustness checks on the rubric extraction pipeline. The revised manuscript now includes a dedicated sensitivity analysis subsection. We re-ran the extraction using (i) three prompt variants, (ii) two different extraction models (GPT-4o and Claude-3), and (iii) significance thresholds of p<0.01, p<0.05, and p<0.1. Across these variations the core set of 35 misalignments remains stable (32-38 items), and their mapping to established software-engineering criteria is unchanged. The results are reported in the main text and an expanded appendix. revision: yes

Circularity Check

0 steps flagged

Empirical comparison to external human annotations with no self-referential derivations

full rationale

The paper's results rest on direct measurement of LLM judge outputs against independently collected human preference labels in three modalities, followed by rubric extraction to surface misalignments. No equations, fitted parameters, or derivations reduce the reported 12-23% gaps or the count of 35 misalignments to quantities defined by the same data or self-citations; human annotations function as an external benchmark rather than being constructed from the model evaluations. The derivation chain is therefore self-contained against external data and does not match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on treating collected human preferences as an unbiased gold standard and assuming the rubric extraction process faithfully surfaces existing code quality criteria without new artifacts. No free parameters are explicitly fitted in the abstract description, and no new physical or theoretical entities are postulated.

axioms (2)

domain assumption Human annotations collected for the study represent stable, representative developer preferences across the tested modalities.
The performance gap and misalignment counts are computed relative to these annotations as the reference.
domain assumption Statistical significance thresholds used to identify the 35 misalignments are appropriate and not sensitive to post-hoc choices.
The abstract states 35 significant sources without detailing the exact test or correction method.

invented entities (1)

TRACE framework no independent evidence
purpose: Automated evaluation of LLM judges via rubric extraction and preference prediction
New tool introduced to measure alignment and extract bias sources

pith-pipeline@v0.9.0 · 5495 in / 1443 out tokens · 38422 ms · 2026-05-15T07:12:04.487848+00:00 · methodology

Comparing Developer and LLM Biases in Code Evaluation

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)