CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging

Aleksandar Cvejic; Peter Wonka; Zehui Chen; Zhenyu Li

arxiv: 2605.19665 · v1 · pith:4LTYRC5Jnew · submitted 2026-05-19 · 💻 cs.SE · cs.AI

CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging

Zhenyu Li , Aleksandar Cvejic , Zehui Chen , Peter Wonka This is my paper

Pith reviewed 2026-05-20 04:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code preference judgingpairwise evaluationrubric-based judgingLLM judgeshuman preference alignmentcode generation evaluationcriterion alignment

0 comments

The pith

CriterAlign improves code preference judging accuracy from 60.4 percent to 66.3 percent by shifting to direct pairwise comparisons on each criterion plus guidance extracted from human preference gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard ways of using rubrics for code evaluation do not match the needs of pairwise preference prediction. Monolithic judges or pointwise scoring followed by comparison both leave room for improvement when responses differ on task-specific trade-offs. CriterAlign instead has the model judge the two responses against each other on every criterion, refines ties, checks consistency by swapping order, and combines the results. It also pulls out recurring differences between human preferences and initial judge outputs to create guidance that gets inserted into the criterion generation and judging steps. This produces higher accuracy on the BigCodeReward benchmark.

Core claim

CriterAlign adapts rubric-based judging to pairwise preference evaluation through direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. It further introduces Human-Preference-Aligned Guidance synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge predictions, then injects this guidance into the criterion generator, criterion judge, and final judge. On BigCodeReward this raises accuracy of a Qwen2.5-VL-32B monolithic judge from 60.4 percent to 66.3 percent.

What carries the argument

CriterAlign, a criterion-centric framework that replaces pointwise scoring with direct pairwise judgments at the criterion level and augments the process with offline-synthesized Human-Preference-Aligned Guidance drawn from human-judge rationale gaps.

Load-bearing premise

The pairwise criterion judgments and offline-extracted guidance will generalize to new code tasks and human preference distributions without overfitting to the particular examples used for guidance synthesis.

What would settle it

Running the complete CriterAlign pipeline on a fresh code preference dataset drawn from different programming tasks and annotator pools and checking whether accuracy still exceeds the monolithic baseline by a similar margin.

Figures

Figures reproduced from arXiv: 2605.19665 by Aleksandar Cvejic, Peter Wonka, Zehui Chen, Zhenyu Li.

**Figure 1.** Figure 1: Learning human preference guidance for code judging. Given the same coding task and two candidate responses, human judges and LLM judges may produce different preference decisions. CRITERALIGN analyzes the human and LLM decisions on training cases, summarizes rationale gaps into an alignment guidance, and injects the guidance into the judge at inference time. This enables the LLM judge to better match huma… view at source ↗

**Figure 2.** Figure 2: Inference Pipeline Comparison. Monolithic judges use fixed or implicit criteria, while rubric-based methods such as RRD [Shen et al., 2026] generate criteria but rely on pointwise criterion refinement and scoring. CRITERALIGN synthesizes human-preference-aligned guidance (HPAG) offline from the training split and injects it into the pairwise rubric-based pipeline for human-aligned inference. Orange highlig… view at source ↗

**Figure 3.** Figure 3: Caption for (int g = 0; g < generations; g++) { vector<int> nextRow(width, 0); for (int i = 0; i < width; i++) { int left = currentRow[(i - 1 + width) % width]; int center = currentRow[i]; int right = currentRow[(i + 1) % width]; nextRow[i] = rule30(left, center, right); } cout << "|" << BRIGHT_BLACK; for (int cell : nextRow) { if (cell == 1) { cout << BRIGHT_GREEN << "X" << BRIGHT_BLACK; } else { cout << … view at source ↗

read the original abstract

Pairwise human preference prediction is central to evaluating code-generation systems, where quality often depends on task-specific trade-offs beyond functional correctness. While rubric-based LLM judges improve interpretability by decomposing evaluation into explicit criteria, most existing pipelines remain pointwise: they score each response independently and derive preferences by comparing aggregated scores. We show that this design is poorly matched to pairwise code preference prediction and can underperform a strong monolithic judge. We propose CriterAlign, a criterion-centric framework that adapts rubric-based judging to pairwise preference evaluation through direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. We further introduce Human-Preference-Aligned Guidance (HPAG), synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge predictions, and injected into the criterion generator, criterion judge, and final judge. On BigCodeReward, CriterAlign improves a Qwen2.5-VL-32B monolithic judge from 60.4% to 66.3% accuracy, with ablations confirming the contributions of pairwise criterion design and HPAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CriterAlign gets a 6-point accuracy lift on BigCodeReward by switching to direct pairwise criterion judgments plus offline HPAG guidance extracted from the same training data.

read the letter

CriterAlign gets a 6-point accuracy lift on BigCodeReward by switching to direct pairwise criterion judgments plus offline HPAG guidance extracted from the same training data. The core move is recognizing that pointwise rubric scoring does not match how humans actually compare code responses, then fixing it with tie-driven refinement, swap-consistency filtering, and final synthesis. They also pull recurring rationale gaps between human labels and a monolithic judge to create HPAG, which gets injected into the criterion generator and both judges. That pipeline is the concrete new piece relative to earlier rubric work. The ablations they report line up with the headline gain, which is helpful for anyone tuning LLM evaluators for code. The improvement over the Qwen2.5-VL-32B baseline is stated clearly and the design choices address a real mismatch in current practice. The main soft spot is that HPAG is built offline from the training examples on BigCodeReward itself. If those recurring gaps partly reflect benchmark-specific labeler habits or dataset artifacts rather than general code preference patterns, the lift could shrink on new data. The abstract mentions ablations for the pairwise design and HPAG but does not describe held-out or cross-benchmark checks, so robustness remains an open question. This is aimed at people building or evaluating LLM-based code judges and reward models. Readers who need a practical way to improve pairwise accuracy on code tasks will get usable ideas from the pipeline and numbers. I would send it for peer review. The empirical result is concrete enough to justify referee time, even if the generalization story needs more evidence in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CriterAlign, a criterion-centric framework for pairwise code preference judging in code generation evaluation. It replaces pointwise rubric scoring with direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. The authors introduce Human-Preference-Aligned Guidance (HPAG), synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge outputs, then injected into the criterion generator, judge, and synthesis stage. On BigCodeReward, the method lifts accuracy of a Qwen2.5-VL-32B monolithic judge from 60.4% to 66.3%, with ablations attributing gains to the pairwise design and HPAG.

Significance. If the reported gains hold under proper generalization checks, the work could meaningfully advance interpretable LLM judges for code preferences by aligning evaluation more closely with human rationales at the criterion level. The explicit ablations and the offline HPAG construction are strengths that allow component-wise assessment. The result is potentially impactful for preference modeling in code generation, provided the guidance does not overfit to the training distribution.

major comments (2)

[§4 and §3.3] §4 (Experiments) and §3.3 (HPAG construction): The 5.9-point accuracy lift on BigCodeReward rests on HPAG extracted from the same training examples used to tune the system. No held-out preference distribution, cross-dataset test (e.g., on HumanEval or CodeContests), or ablation that removes HPAG while keeping the training set fixed is reported. Without such evidence, it remains possible that the extracted guidance captures benchmark-specific labeler artifacts rather than generalizable rationale gaps, undermining the central claim that the criterion-centric architecture plus HPAG produces robust improvement.
[§4.2] §4.2 (Ablation table): The ablations isolate pairwise criterion design and HPAG, yet the table does not report variance across multiple random seeds or dataset splits. Given that the headline result is a single-point accuracy comparison (60.4% → 66.3%), the absence of error bars or statistical significance tests makes it difficult to determine whether the observed lift is reliable or sensitive to the particular train/test partition of BigCodeReward.

minor comments (2)

[§3.2] The description of swap-consistency filtering would benefit from a quantitative breakdown (e.g., fraction of pairs filtered and its effect on final accuracy) to clarify its contribution beyond the qualitative motivation.
[§3.4] Notation for the final pairwise synthesis step could be made more explicit; a small equation or pseudocode block would help readers trace how criterion-level judgments are aggregated into the overall preference decision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback, which identifies key areas for strengthening claims about generalization and statistical robustness. We respond to each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§4 and §3.3] §4 (Experiments) and §3.3 (HPAG construction): The 5.9-point accuracy lift on BigCodeReward rests on HPAG extracted from the same training examples used to tune the system. No held-out preference distribution, cross-dataset test (e.g., on HumanEval or CodeContests), or ablation that removes HPAG while keeping the training set fixed is reported. Without such evidence, it remains possible that the extracted guidance captures benchmark-specific labeler artifacts rather than generalizable rationale gaps, undermining the central claim that the criterion-centric architecture plus HPAG produces robust improvement.

Authors: We appreciate the referee highlighting the risk that HPAG may capture dataset-specific artifacts. The construction process in §3.3 extracts recurring rationale gaps between human preferences and monolithic judge outputs across the training examples, with the intent of identifying systematic patterns rather than instance-specific noise. Existing ablations in §4.2 already isolate HPAG by removing it while retaining the same training data and other components, showing a performance drop that supports its contribution. Nevertheless, we acknowledge the absence of cross-dataset testing. In the revised manuscript we will add evaluations on HumanEval and CodeContests using the same CriterAlign pipeline (with HPAG extracted only from BigCodeReward training data) to provide direct evidence of generalization beyond the original benchmark. revision: yes
Referee: [§4.2] §4.2 (Ablation table): The ablations isolate pairwise criterion design and HPAG, yet the table does not report variance across multiple random seeds or dataset splits. Given that the headline result is a single-point accuracy comparison (60.4% → 66.3%), the absence of error bars or statistical significance tests makes it difficult to determine whether the observed lift is reliable or sensitive to the particular train/test partition of BigCodeReward.

Authors: We agree that reporting variance and statistical measures would improve confidence in the results. The headline numbers and ablations were obtained from single runs, primarily due to the substantial computational cost of repeated inference with 32B-scale models. The ablation table does demonstrate consistent directional gains when components are added or removed. In the revised version we will rerun the key configurations (monolithic baseline, full CriterAlign, and the two main ablations) across three random seeds on the BigCodeReward test split and report means with standard deviations or confidence intervals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark gains remain independent of inputs.

full rationale

The paper describes an empirical framework (CriterAlign) that applies direct criterion-level pairwise judgments, tie-driven refinement, swap-consistency filtering, final synthesis, and offline-synthesized HPAG guidance extracted from training examples. The central result is a reported accuracy lift on the external BigCodeReward benchmark (60.4% to 66.3%). No equations, derivations, or self-referential definitions are present that would reduce any claimed prediction or result to its own fitted inputs by construction. HPAG synthesis is a one-time offline extraction step whose output is then used as guidance; this does not make the downstream accuracy score tautological or equivalent to the extraction process itself. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or described chain. The improvement is therefore presented as a self-contained empirical outcome against an external benchmark rather than a restatement of the method's own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that LLMs can reliably perform criterion-wise pairwise comparisons and that recurring rationale gaps extracted from a training set provide useful guidance without introducing bias; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption LLM judges can be improved by injecting human-preference rationale gaps extracted offline from training examples
Central to the HPAG component described in the abstract

invented entities (1)

HPAG (Human-Preference-Aligned Guidance) no independent evidence
purpose: Synthesized guidance injected into criterion generator, criterion judge, and final judge to align with human preferences
Extracted from training examples where human preferences differ from monolithic judge predictions

pith-pipeline@v0.9.0 · 5729 in / 1400 out tokens · 36333 ms · 2026-05-20T04:29:38.145821+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis plus HPAG
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Human-Preference-Aligned Guidance (HPAG) synthesized offline from training examples by extracting recurring rationale gaps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

[1]

The user’s programming instruction

work page
[2]

Both solutions (code + execution output + screenshots if available)

work page
[3]

The human’s per-aspect votes (correctness, efficiency, explainability, maintainability, UI/UX design)

work page
[4]

reasoning

The human’s overall vote Your task: Synthesize a coherent rationale that explains the human’s overall preference, grounded in their aspect-level judgments. Where aspects disagree with the overall vote, explain the likely reasoning for why certain aspects were weighted more heavily. Think like a practical developer, not a formal reviewer. **Human Aspect Vo...

work page
[5]

**Criterion generation**: An LLM generates 16-20 atomic evaluation criteria for the given task

work page
[6]

**Per-criterion judging**: For each criterion, the LLM judges which solution is better (A/B/tie)

work page
[7]

reverse-engineered human rationales

**Final judging**: The LLM makes an overall preference decision using the criterion judgments as evidence At runtime, a sample’s task category is known. Each of the three stage-LLMs will be shown the **global guidance for its stage concatenated with the category-specific guidance for its stage**. So for every category we need the same four kinds of guidan...

work page
[8]

**Global guidance** (applies to every sample, cross-cutting patterns): same structure as before -- divergence patterns plus stage-specific guidance for criterion generation, per-criterion judging, and final judging. 21

work page
[9]

key_divergence_patterns

**Per-category guidance** (applies only when the sample is of that category, category-distinctive patterns): for **each** of the six categories, produce the *same four kinds* of content as the global block -- divergence patterns, criterion-generation guidance, criterion-judging guidance, final-judging guidance -- at **comparable length** to the global ver...

work page
[10]

**Atomic** -- one checkable claim, with no "and/or" combinations

work page
[11]

good code

**Specific** -- concrete observable properties, not abstractions like "good code"

work page
[12]

**Judgeable** -- a downstream model can decide it from the code, answer text, and execution evidence

work page
[13]

**Task-relevant** -- tied to the instruction’s requirements, constraints, expected outputs, or user- valued qualities for this task

work page
[14]

**Non-redundant** -- no near-duplicates or trivial rephrasings

work page
[15]

The solution

**Comparative-useful** -- the criterion should capture a property on which the two responses could plausibly differ in a way that affects user preference **Important goal: prefer preference-driving criteria over adequacy-only checks.** - Prefer criteria that would actually help distinguish which solution is better. - Avoid criteria that both solutions are...

work page
[16]

Whether the solution correctly fulfills the user’s actual request

work page
[17]

Whether it handles important edge cases or failure modes relevant to this task

work page
[18]

Whether it respects explicit constraints in the instruction

work page
[19]

Whether one response is meaningfully more useful, complete, robust, or user-aligned

work page
[20]

criteria

Only then consider secondary qualities like readability, comments, or maintainability, and only if they are likely to affect user preference here --- <|Instruction|> {INSTRUCTION} <|The Start of Assistant A’s Answer|> {ANSWER_A}{SCREENSHOT_A_SECTION}{VISUAL_A_SECTION}<|The End of Assistant A’s Answer|> <|The Start of Assistant B’s Answer|> {ANSWER_B}{SCRE...

work page
[21]

Identify the key requirements, explicit constraints, likely failure modes, and meaningful quality differences from the instruction and both solutions

work page
[22]

Draft criteria that are grounded in those concrete requirements and likely to influence overall preference

work page
[23]

Remove any criterion that is vague, redundant, weakly judgeable, or unlikely to distinguish the two responses

work page
[24]

Remove excess adequacy-only checks if they dominate the list

work page
[25]

Verify each surviving criterion is atomic, judgeable, response-neutral, and useful for pairwise comparison

work page
[26]

A"‘ -- Solution A clearly better satisfies this criterion - ‘

Output JSON only E.4 Pairwise criterion judging Listing 6: Pairwise criterion judging prompt template. You are a code evaluation judge. Your task is to compare two candidate solutions (A and B) against a specific list of evaluation criteria. For each criterion, determine which solution better satisfies it based on the code implementations and execution re...

work page
[27]

Do not treat Solution A as the default or reference answer

Judge each criterion **symmetrically**. Do not treat Solution A as the default or reference answer

work page
[28]

If the positions of A and B were swapped, the judgment should swap accordingly

work page
[29]

Base the judgment on **comparative evidence**, not merely on whether each solution individually clears a minimum bar

work page
[30]

Use ‘"tie"‘ only when the available evidence indicates the two solutions are genuinely comparable on this specific criterion

work page
[31]

A"‘ when the difference is subtle. If the evidence slightly but meaningfully favors B, choose ‘

Do NOT default to ‘"A"‘ when the difference is subtle. If the evidence slightly but meaningfully favors B, choose ‘"B"‘

work page
[32]

If both solutions are flawed in different ways, still choose the one that better satisfies the criterion unless they are genuinely comparable

work page
[33]

criterion_results

Do not let your judgment on one criterion leak into another; evaluate each criterion independently. **Anti-bias requirement** - Be strictly position-invariant. - Do not assume A is better because it appears first. - Do not use style, verbosity, or answer order as a tiebreaker unless the criterion explicitly concerns those aspects. --- <|Instruction|> {INS...

work page
[34]

Correctly implementing the requested functionality

work page
[35]

Respecting explicit constraints and requirements

work page
[36]

A"‘: Solution A is clearly better overall - ‘

Avoiding important errors, omissions, or misleading behavior You may also consider efficiency, explainability, maintainability, and UI/UX when relevant, but these should not outweigh major correctness or requirement-fulfillment differences. **Human-alignment guidance (derived from analysis of human-vs-LLM preference disagreements):** {GUIDANCE} **Category...

work page
[37]

Do not treat Solution A as the default or reference

Evaluate the two responses **symmetrically**. Do not treat Solution A as the default or reference

work page
[38]

Focus on the **most decisive differences**, not on counting superficial advantages

work page
[39]

Many local ties or minor advantages do not necessarily imply an overall tie

Do not mechanically follow the majority of per-criterion labels. Many local ties or minor advantages do not necessarily imply an overall tie

work page
[40]

A small number of high-impact differences may outweigh many minor equalities

work page
[41]

Use ‘"Tie"‘ only when the solutions are genuinely comparable in the aspects that matter most to the user’s request

work page
[42]

If the positions of A and B were swapped, your overall judgment should swap accordingly

work page
[43]

reasoning

Do not favor A because it appears first, is longer, sounds more confident, or is more stylistically polished unless those qualities materially improve fulfillment of the user’s request. 28 **Input Format** <|Instruction|> {INSTRUCTION} <|The Start of Assistant A’s Answer|> {ANSWER_A}{SCREENSHOT_A_SECTION}{VISUAL_A_SECTION}<|The End of Assistant A’s Answer...

work page
[44]

We only use existing benchmark annotations

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

The user’s programming instruction

work page

[2] [2]

Both solutions (code + execution output + screenshots if available)

work page

[3] [3]

The human’s per-aspect votes (correctness, efficiency, explainability, maintainability, UI/UX design)

work page

[4] [4]

reasoning

The human’s overall vote Your task: Synthesize a coherent rationale that explains the human’s overall preference, grounded in their aspect-level judgments. Where aspects disagree with the overall vote, explain the likely reasoning for why certain aspects were weighted more heavily. Think like a practical developer, not a formal reviewer. **Human Aspect Vo...

work page

[5] [5]

**Criterion generation**: An LLM generates 16-20 atomic evaluation criteria for the given task

work page

[6] [6]

**Per-criterion judging**: For each criterion, the LLM judges which solution is better (A/B/tie)

work page

[7] [7]

reverse-engineered human rationales

**Final judging**: The LLM makes an overall preference decision using the criterion judgments as evidence At runtime, a sample’s task category is known. Each of the three stage-LLMs will be shown the **global guidance for its stage concatenated with the category-specific guidance for its stage**. So for every category we need the same four kinds of guidan...

work page

[8] [8]

**Global guidance** (applies to every sample, cross-cutting patterns): same structure as before -- divergence patterns plus stage-specific guidance for criterion generation, per-criterion judging, and final judging. 21

work page

[9] [9]

key_divergence_patterns

**Per-category guidance** (applies only when the sample is of that category, category-distinctive patterns): for **each** of the six categories, produce the *same four kinds* of content as the global block -- divergence patterns, criterion-generation guidance, criterion-judging guidance, final-judging guidance -- at **comparable length** to the global ver...

work page

[10] [10]

**Atomic** -- one checkable claim, with no "and/or" combinations

work page

[11] [11]

good code

**Specific** -- concrete observable properties, not abstractions like "good code"

work page

[12] [12]

**Judgeable** -- a downstream model can decide it from the code, answer text, and execution evidence

work page

[13] [13]

**Task-relevant** -- tied to the instruction’s requirements, constraints, expected outputs, or user- valued qualities for this task

work page

[14] [14]

**Non-redundant** -- no near-duplicates or trivial rephrasings

work page

[15] [15]

The solution

**Comparative-useful** -- the criterion should capture a property on which the two responses could plausibly differ in a way that affects user preference **Important goal: prefer preference-driving criteria over adequacy-only checks.** - Prefer criteria that would actually help distinguish which solution is better. - Avoid criteria that both solutions are...

work page

[16] [16]

Whether the solution correctly fulfills the user’s actual request

work page

[17] [17]

Whether it handles important edge cases or failure modes relevant to this task

work page

[18] [18]

Whether it respects explicit constraints in the instruction

work page

[19] [19]

Whether one response is meaningfully more useful, complete, robust, or user-aligned

work page

[20] [20]

criteria

Only then consider secondary qualities like readability, comments, or maintainability, and only if they are likely to affect user preference here --- <|Instruction|> {INSTRUCTION} <|The Start of Assistant A’s Answer|> {ANSWER_A}{SCREENSHOT_A_SECTION}{VISUAL_A_SECTION}<|The End of Assistant A’s Answer|> <|The Start of Assistant B’s Answer|> {ANSWER_B}{SCRE...

work page

[21] [21]

Identify the key requirements, explicit constraints, likely failure modes, and meaningful quality differences from the instruction and both solutions

work page

[22] [22]

Draft criteria that are grounded in those concrete requirements and likely to influence overall preference

work page

[23] [23]

Remove any criterion that is vague, redundant, weakly judgeable, or unlikely to distinguish the two responses

work page

[24] [24]

Remove excess adequacy-only checks if they dominate the list

work page

[25] [25]

Verify each surviving criterion is atomic, judgeable, response-neutral, and useful for pairwise comparison

work page

[26] [26]

A"‘ -- Solution A clearly better satisfies this criterion - ‘

Output JSON only E.4 Pairwise criterion judging Listing 6: Pairwise criterion judging prompt template. You are a code evaluation judge. Your task is to compare two candidate solutions (A and B) against a specific list of evaluation criteria. For each criterion, determine which solution better satisfies it based on the code implementations and execution re...

work page

[27] [27]

Do not treat Solution A as the default or reference answer

Judge each criterion **symmetrically**. Do not treat Solution A as the default or reference answer

work page

[28] [28]

If the positions of A and B were swapped, the judgment should swap accordingly

work page

[29] [29]

Base the judgment on **comparative evidence**, not merely on whether each solution individually clears a minimum bar

work page

[30] [30]

Use ‘"tie"‘ only when the available evidence indicates the two solutions are genuinely comparable on this specific criterion

work page

[31] [31]

A"‘ when the difference is subtle. If the evidence slightly but meaningfully favors B, choose ‘

Do NOT default to ‘"A"‘ when the difference is subtle. If the evidence slightly but meaningfully favors B, choose ‘"B"‘

work page

[32] [32]

If both solutions are flawed in different ways, still choose the one that better satisfies the criterion unless they are genuinely comparable

work page

[33] [33]

criterion_results

Do not let your judgment on one criterion leak into another; evaluate each criterion independently. **Anti-bias requirement** - Be strictly position-invariant. - Do not assume A is better because it appears first. - Do not use style, verbosity, or answer order as a tiebreaker unless the criterion explicitly concerns those aspects. --- <|Instruction|> {INS...

work page

[34] [34]

Correctly implementing the requested functionality

work page

[35] [35]

Respecting explicit constraints and requirements

work page

[36] [36]

A"‘: Solution A is clearly better overall - ‘

Avoiding important errors, omissions, or misleading behavior You may also consider efficiency, explainability, maintainability, and UI/UX when relevant, but these should not outweigh major correctness or requirement-fulfillment differences. **Human-alignment guidance (derived from analysis of human-vs-LLM preference disagreements):** {GUIDANCE} **Category...

work page

[37] [37]

Do not treat Solution A as the default or reference

Evaluate the two responses **symmetrically**. Do not treat Solution A as the default or reference

work page

[38] [38]

Focus on the **most decisive differences**, not on counting superficial advantages

work page

[39] [39]

Many local ties or minor advantages do not necessarily imply an overall tie

Do not mechanically follow the majority of per-criterion labels. Many local ties or minor advantages do not necessarily imply an overall tie

work page

[40] [40]

A small number of high-impact differences may outweigh many minor equalities

work page

[41] [41]

Use ‘"Tie"‘ only when the solutions are genuinely comparable in the aspects that matter most to the user’s request

work page

[42] [42]

If the positions of A and B were swapped, your overall judgment should swap accordingly

work page

[43] [43]

reasoning

Do not favor A because it appears first, is longer, sounds more confident, or is more stylistically polished unless those qualities materially improve fulfillment of the user’s request. 28 **Input Format** <|Instruction|> {INSTRUCTION} <|The Start of Assistant A’s Answer|> {ANSWER_A}{SCREENSHOT_A_SECTION}{VISUAL_A_SECTION}<|The End of Assistant A’s Answer...

work page

[44] [44]

We only use existing benchmark annotations

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page