pith. sign in

arxiv: 2603.24586 · v2 · pith:WZNM7LOHnew · submitted 2026-03-25 · 💻 cs.SE · cs.CL

Comparing Developer and LLM Biases in Code Evaluation

Pith reviewed 2026-05-15 07:12 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords LLM judgescode evaluationhuman preferencesbias alignmentrubric analysissoftware engineering criteriainteractive coding
0
0 comments X

The pith

LLM judges underperform human annotators by 12-23% when predicting developer code preferences across realistic tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TRACE, a framework that tests how well LLM judges match human developer preferences in interactive code scenarios. It evaluates 13 models on chat-based programming, IDE autocompletion, and instructed code editing. Even the strongest models fall short of human annotators by 12-23 percent. TRACE automatically extracts rubrics and uncovers 35 significant sources of misalignment, most of which map to established software engineering code quality criteria. Examples include judges favoring longer explanations where humans prefer shorter ones, exposing systematic gaps when LLMs serve as evaluators in practical coding workflows.

Core claim

TRACE demonstrates that LLM judges exhibit systematic misalignment with human preferences on code quality dimensions, with the best models underperforming human annotators by 12-23% across three interaction modalities.

What carries the argument

TRACE (Tool for Rubric Analysis in Code Evaluation), which measures LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal biases.

If this is right

  • Judges show bias toward longer code explanations in chat-based coding while humans prefer shorter ones.
  • Misalignment appears on the majority of existing code quality dimensions.
  • LLM judge evaluation requires realistic interactive settings that include partial context and ambiguous intent.
  • Automatic rubric extraction can isolate sources of human-model disagreement on code quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Addressing the identified biases through targeted fine-tuning could make LLM judges more reliable for automated code review.
  • The same misalignment patterns may extend to LLM evaluation in non-code domains such as writing or design tasks.
  • Refining how human preferences are collected could reduce noise in the ground truth used to train or benchmark judges.

Load-bearing premise

Human preferences collected via annotation are treated as the ground truth without significant bias or inconsistency.

What would settle it

Repeating the annotation with a fresh group of developers on the same code examples and checking whether the 35 identified misalignment sources stay consistent.

read the original abstract

As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and instructed code editing -- we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the TRACE framework to evaluate LLM judges against human developer preferences in realistic code evaluation settings across three modalities (chat-based programming, IDE autocompletion, and instructed code editing). Using annotations from human developers, it reports that the best of 13 LLM judges underperform humans by 12-23% and automatically extracts 35 significant sources of misalignment, the majority of which map to established software engineering code quality criteria (e.g., preference for shorter explanations in chat-based coding).

Significance. If the central empirical claims hold after addressing methodological gaps, the work would be significant for software engineering and LLM evaluation research. It supplies a concrete tool for rubric extraction and bias quantification in interactive coding contexts, with quantitative gaps and explicit ties to existing SE criteria that could guide improvements in automated code judges.

major comments (2)
  1. [Methods / Annotation Protocol] The manuscript treats collected human annotations as ground truth for the headline 12-23% performance gaps and the 35 misalignments without reporting inter-annotator agreement (Cohen's/Fleiss' kappa, percentage agreement, or equivalent) or consistency metrics. This is load-bearing for the central claims in the abstract and results sections, as substantial annotator disagreement would imply that measured LLM-human gaps partly reflect label noise rather than systematic bias.
  2. [TRACE Framework / Rubric Extraction] The automatic rubric extraction component of TRACE lacks robustness checks; there is no reported analysis of sensitivity to prompt variations, extraction-model choice, or threshold settings used to identify the 35 significant misalignments. This directly affects the reliability of the misalignment counts and their correspondence to SE criteria.
minor comments (2)
  1. [Abstract and Results] The abstract and results should explicitly state the statistical threshold or test used to declare a misalignment 'significant' and the precise metric (accuracy, agreement rate, etc.) underlying the 12-23% underperformance figure.
  2. [Experimental Setup] Clarify the sample sizes, number of code snippets per modality, and exact annotation instructions provided to human developers to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have helped us strengthen the methodological presentation of our work. We address each major comment point by point below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Methods / Annotation Protocol] The manuscript treats collected human annotations as ground truth for the headline 12-23% performance gaps and the 35 misalignments without reporting inter-annotator agreement (Cohen's/Fleiss' kappa, percentage agreement, or equivalent) or consistency metrics. This is load-bearing for the central claims in the abstract and results sections, as substantial annotator disagreement would imply that measured LLM-human gaps partly reflect label noise rather than systematic bias.

    Authors: We agree that reporting inter-annotator agreement is essential for validating the human annotations as a reliable reference. In the revised manuscript we have added Fleiss' kappa and percentage agreement statistics for each of the three modalities. The computed values (kappa ranging from 0.67 to 0.81) indicate substantial agreement. These metrics are now presented in a new subsection of the Methods section and summarized in Table 2. The added analysis confirms that the reported 12-23% gaps and the extracted misalignments are not driven by label noise. revision: yes

  2. Referee: [TRACE Framework / Rubric Extraction] The automatic rubric extraction component of TRACE lacks robustness checks; there is no reported analysis of sensitivity to prompt variations, extraction-model choice, or threshold settings used to identify the 35 significant misalignments. This directly affects the reliability of the misalignment counts and their correspondence to SE criteria.

    Authors: We acknowledge the need for explicit robustness checks on the rubric extraction pipeline. The revised manuscript now includes a dedicated sensitivity analysis subsection. We re-ran the extraction using (i) three prompt variants, (ii) two different extraction models (GPT-4o and Claude-3), and (iii) significance thresholds of p<0.01, p<0.05, and p<0.1. Across these variations the core set of 35 misalignments remains stable (32-38 items), and their mapping to established software-engineering criteria is unchanged. The results are reported in the main text and an expanded appendix. revision: yes

Circularity Check

0 steps flagged

Empirical comparison to external human annotations with no self-referential derivations

full rationale

The paper's results rest on direct measurement of LLM judge outputs against independently collected human preference labels in three modalities, followed by rubric extraction to surface misalignments. No equations, fitted parameters, or derivations reduce the reported 12-23% gaps or the count of 35 misalignments to quantities defined by the same data or self-citations; human annotations function as an external benchmark rather than being constructed from the model evaluations. The derivation chain is therefore self-contained against external data and does not match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on treating collected human preferences as an unbiased gold standard and assuming the rubric extraction process faithfully surfaces existing code quality criteria without new artifacts. No free parameters are explicitly fitted in the abstract description, and no new physical or theoretical entities are postulated.

axioms (2)
  • domain assumption Human annotations collected for the study represent stable, representative developer preferences across the tested modalities.
    The performance gap and misalignment counts are computed relative to these annotations as the reference.
  • domain assumption Statistical significance thresholds used to identify the 35 misalignments are appropriate and not sensitive to post-hoc choices.
    The abstract states 35 significant sources without detailing the exact test or correction method.
invented entities (1)
  • TRACE framework no independent evidence
    purpose: Automated evaluation of LLM judges via rubric extraction and preference prediction
    New tool introduced to measure alignment and extract bias sources

pith-pipeline@v0.9.0 · 5495 in / 1443 out tokens · 38422 ms · 2026-05-15T07:12:04.487848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.