pith. sign in

arxiv: 2605.19665 · v1 · pith:4LTYRC5Jnew · submitted 2026-05-19 · 💻 cs.SE · cs.AI

CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging

Pith reviewed 2026-05-20 04:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords code preference judgingpairwise evaluationrubric-based judgingLLM judgeshuman preference alignmentcode generation evaluationcriterion alignment
0
0 comments X

The pith

CriterAlign improves code preference judging accuracy from 60.4 percent to 66.3 percent by shifting to direct pairwise comparisons on each criterion plus guidance extracted from human preference gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard ways of using rubrics for code evaluation do not match the needs of pairwise preference prediction. Monolithic judges or pointwise scoring followed by comparison both leave room for improvement when responses differ on task-specific trade-offs. CriterAlign instead has the model judge the two responses against each other on every criterion, refines ties, checks consistency by swapping order, and combines the results. It also pulls out recurring differences between human preferences and initial judge outputs to create guidance that gets inserted into the criterion generation and judging steps. This produces higher accuracy on the BigCodeReward benchmark.

Core claim

CriterAlign adapts rubric-based judging to pairwise preference evaluation through direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. It further introduces Human-Preference-Aligned Guidance synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge predictions, then injects this guidance into the criterion generator, criterion judge, and final judge. On BigCodeReward this raises accuracy of a Qwen2.5-VL-32B monolithic judge from 60.4 percent to 66.3 percent.

What carries the argument

CriterAlign, a criterion-centric framework that replaces pointwise scoring with direct pairwise judgments at the criterion level and augments the process with offline-synthesized Human-Preference-Aligned Guidance drawn from human-judge rationale gaps.

Load-bearing premise

The pairwise criterion judgments and offline-extracted guidance will generalize to new code tasks and human preference distributions without overfitting to the particular examples used for guidance synthesis.

What would settle it

Running the complete CriterAlign pipeline on a fresh code preference dataset drawn from different programming tasks and annotator pools and checking whether accuracy still exceeds the monolithic baseline by a similar margin.

Figures

Figures reproduced from arXiv: 2605.19665 by Aleksandar Cvejic, Peter Wonka, Zehui Chen, Zhenyu Li.

Figure 1
Figure 1. Figure 1: Learning human preference guidance for code judging. Given the same coding task and two candidate responses, human judges and LLM judges may produce different preference decisions. CRITERALIGN analyzes the human and LLM decisions on training cases, summarizes rationale gaps into an alignment guidance, and injects the guidance into the judge at inference time. This enables the LLM judge to better match huma… view at source ↗
Figure 2
Figure 2. Figure 2: Inference Pipeline Comparison. Monolithic judges use fixed or implicit criteria, while rubric-based methods such as RRD [Shen et al., 2026] generate criteria but rely on pointwise criterion refinement and scoring. CRITERALIGN synthesizes human-preference-aligned guidance (HPAG) offline from the training split and injects it into the pairwise rubric-based pipeline for human-aligned inference. Orange highlig… view at source ↗
Figure 3
Figure 3. Figure 3: Caption for (int g = 0; g < generations; g++) { vector<int> nextRow(width, 0); for (int i = 0; i < width; i++) { int left = currentRow[(i - 1 + width) % width]; int center = currentRow[i]; int right = currentRow[(i + 1) % width]; nextRow[i] = rule30(left, center, right); } cout << "|" << BRIGHT_BLACK; for (int cell : nextRow) { if (cell == 1) { cout << BRIGHT_GREEN << "X" << BRIGHT_BLACK; } else { cout << … view at source ↗
read the original abstract

Pairwise human preference prediction is central to evaluating code-generation systems, where quality often depends on task-specific trade-offs beyond functional correctness. While rubric-based LLM judges improve interpretability by decomposing evaluation into explicit criteria, most existing pipelines remain pointwise: they score each response independently and derive preferences by comparing aggregated scores. We show that this design is poorly matched to pairwise code preference prediction and can underperform a strong monolithic judge. We propose CriterAlign, a criterion-centric framework that adapts rubric-based judging to pairwise preference evaluation through direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. We further introduce Human-Preference-Aligned Guidance (HPAG), synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge predictions, and injected into the criterion generator, criterion judge, and final judge. On BigCodeReward, CriterAlign improves a Qwen2.5-VL-32B monolithic judge from 60.4% to 66.3% accuracy, with ablations confirming the contributions of pairwise criterion design and HPAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CriterAlign, a criterion-centric framework for pairwise code preference judging in code generation evaluation. It replaces pointwise rubric scoring with direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. The authors introduce Human-Preference-Aligned Guidance (HPAG), synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge outputs, then injected into the criterion generator, judge, and synthesis stage. On BigCodeReward, the method lifts accuracy of a Qwen2.5-VL-32B monolithic judge from 60.4% to 66.3%, with ablations attributing gains to the pairwise design and HPAG.

Significance. If the reported gains hold under proper generalization checks, the work could meaningfully advance interpretable LLM judges for code preferences by aligning evaluation more closely with human rationales at the criterion level. The explicit ablations and the offline HPAG construction are strengths that allow component-wise assessment. The result is potentially impactful for preference modeling in code generation, provided the guidance does not overfit to the training distribution.

major comments (2)
  1. [§4 and §3.3] §4 (Experiments) and §3.3 (HPAG construction): The 5.9-point accuracy lift on BigCodeReward rests on HPAG extracted from the same training examples used to tune the system. No held-out preference distribution, cross-dataset test (e.g., on HumanEval or CodeContests), or ablation that removes HPAG while keeping the training set fixed is reported. Without such evidence, it remains possible that the extracted guidance captures benchmark-specific labeler artifacts rather than generalizable rationale gaps, undermining the central claim that the criterion-centric architecture plus HPAG produces robust improvement.
  2. [§4.2] §4.2 (Ablation table): The ablations isolate pairwise criterion design and HPAG, yet the table does not report variance across multiple random seeds or dataset splits. Given that the headline result is a single-point accuracy comparison (60.4% → 66.3%), the absence of error bars or statistical significance tests makes it difficult to determine whether the observed lift is reliable or sensitive to the particular train/test partition of BigCodeReward.
minor comments (2)
  1. [§3.2] The description of swap-consistency filtering would benefit from a quantitative breakdown (e.g., fraction of pairs filtered and its effect on final accuracy) to clarify its contribution beyond the qualitative motivation.
  2. [§3.4] Notation for the final pairwise synthesis step could be made more explicit; a small equation or pseudocode block would help readers trace how criterion-level judgments are aggregated into the overall preference decision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback, which identifies key areas for strengthening claims about generalization and statistical robustness. We respond to each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4 and §3.3] §4 (Experiments) and §3.3 (HPAG construction): The 5.9-point accuracy lift on BigCodeReward rests on HPAG extracted from the same training examples used to tune the system. No held-out preference distribution, cross-dataset test (e.g., on HumanEval or CodeContests), or ablation that removes HPAG while keeping the training set fixed is reported. Without such evidence, it remains possible that the extracted guidance captures benchmark-specific labeler artifacts rather than generalizable rationale gaps, undermining the central claim that the criterion-centric architecture plus HPAG produces robust improvement.

    Authors: We appreciate the referee highlighting the risk that HPAG may capture dataset-specific artifacts. The construction process in §3.3 extracts recurring rationale gaps between human preferences and monolithic judge outputs across the training examples, with the intent of identifying systematic patterns rather than instance-specific noise. Existing ablations in §4.2 already isolate HPAG by removing it while retaining the same training data and other components, showing a performance drop that supports its contribution. Nevertheless, we acknowledge the absence of cross-dataset testing. In the revised manuscript we will add evaluations on HumanEval and CodeContests using the same CriterAlign pipeline (with HPAG extracted only from BigCodeReward training data) to provide direct evidence of generalization beyond the original benchmark. revision: yes

  2. Referee: [§4.2] §4.2 (Ablation table): The ablations isolate pairwise criterion design and HPAG, yet the table does not report variance across multiple random seeds or dataset splits. Given that the headline result is a single-point accuracy comparison (60.4% → 66.3%), the absence of error bars or statistical significance tests makes it difficult to determine whether the observed lift is reliable or sensitive to the particular train/test partition of BigCodeReward.

    Authors: We agree that reporting variance and statistical measures would improve confidence in the results. The headline numbers and ablations were obtained from single runs, primarily due to the substantial computational cost of repeated inference with 32B-scale models. The ablation table does demonstrate consistent directional gains when components are added or removed. In the revised version we will rerun the key configurations (monolithic baseline, full CriterAlign, and the two main ablations) across three random seeds on the BigCodeReward test split and report means with standard deviations or confidence intervals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark gains remain independent of inputs.

full rationale

The paper describes an empirical framework (CriterAlign) that applies direct criterion-level pairwise judgments, tie-driven refinement, swap-consistency filtering, final synthesis, and offline-synthesized HPAG guidance extracted from training examples. The central result is a reported accuracy lift on the external BigCodeReward benchmark (60.4% to 66.3%). No equations, derivations, or self-referential definitions are present that would reduce any claimed prediction or result to its own fitted inputs by construction. HPAG synthesis is a one-time offline extraction step whose output is then used as guidance; this does not make the downstream accuracy score tautological or equivalent to the extraction process itself. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or described chain. The improvement is therefore presented as a self-contained empirical outcome against an external benchmark rather than a restatement of the method's own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that LLMs can reliably perform criterion-wise pairwise comparisons and that recurring rationale gaps extracted from a training set provide useful guidance without introducing bias; no free parameters or invented physical entities are mentioned.

axioms (1)
  • domain assumption LLM judges can be improved by injecting human-preference rationale gaps extracted offline from training examples
    Central to the HPAG component described in the abstract
invented entities (1)
  • HPAG (Human-Preference-Aligned Guidance) no independent evidence
    purpose: Synthesized guidance injected into criterion generator, criterion judge, and final judge to align with human preferences
    Extracted from training examples where human preferences differ from monolithic judge predictions

pith-pipeline@v0.9.0 · 5729 in / 1400 out tokens · 36333 ms · 2026-05-20T04:29:38.145821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

  1. [1]

    The user’s programming instruction

  2. [2]

    Both solutions (code + execution output + screenshots if available)

  3. [3]

    The human’s per-aspect votes (correctness, efficiency, explainability, maintainability, UI/UX design)

  4. [4]

    reasoning

    The human’s overall vote Your task: Synthesize a coherent rationale that explains the human’s overall preference, grounded in their aspect-level judgments. Where aspects disagree with the overall vote, explain the likely reasoning for why certain aspects were weighted more heavily. Think like a practical developer, not a formal reviewer. **Human Aspect Vo...

  5. [5]

    **Criterion generation**: An LLM generates 16-20 atomic evaluation criteria for the given task

  6. [6]

    **Per-criterion judging**: For each criterion, the LLM judges which solution is better (A/B/tie)

  7. [7]

    reverse-engineered human rationales

    **Final judging**: The LLM makes an overall preference decision using the criterion judgments as evidence At runtime, a sample’s task category is known. Each of the three stage-LLMs will be shown the **global guidance for its stage concatenated with the category-specific guidance for its stage**. So for every category we need the same four kinds of guidan...

  8. [8]

    **Global guidance** (applies to every sample, cross-cutting patterns): same structure as before -- divergence patterns plus stage-specific guidance for criterion generation, per-criterion judging, and final judging. 21

  9. [9]

    key_divergence_patterns

    **Per-category guidance** (applies only when the sample is of that category, category-distinctive patterns): for **each** of the six categories, produce the *same four kinds* of content as the global block -- divergence patterns, criterion-generation guidance, criterion-judging guidance, final-judging guidance -- at **comparable length** to the global ver...

  10. [10]

    **Atomic** -- one checkable claim, with no "and/or" combinations

  11. [11]

    good code

    **Specific** -- concrete observable properties, not abstractions like "good code"

  12. [12]

    **Judgeable** -- a downstream model can decide it from the code, answer text, and execution evidence

  13. [13]

    **Task-relevant** -- tied to the instruction’s requirements, constraints, expected outputs, or user- valued qualities for this task

  14. [14]

    **Non-redundant** -- no near-duplicates or trivial rephrasings

  15. [15]

    The solution

    **Comparative-useful** -- the criterion should capture a property on which the two responses could plausibly differ in a way that affects user preference **Important goal: prefer preference-driving criteria over adequacy-only checks.** - Prefer criteria that would actually help distinguish which solution is better. - Avoid criteria that both solutions are...

  16. [16]

    Whether the solution correctly fulfills the user’s actual request

  17. [17]

    Whether it handles important edge cases or failure modes relevant to this task

  18. [18]

    Whether it respects explicit constraints in the instruction

  19. [19]

    Whether one response is meaningfully more useful, complete, robust, or user-aligned

  20. [20]

    criteria

    Only then consider secondary qualities like readability, comments, or maintainability, and only if they are likely to affect user preference here --- <|Instruction|> {INSTRUCTION} <|The Start of Assistant A’s Answer|> {ANSWER_A}{SCREENSHOT_A_SECTION}{VISUAL_A_SECTION}<|The End of Assistant A’s Answer|> <|The Start of Assistant B’s Answer|> {ANSWER_B}{SCRE...

  21. [21]

    Identify the key requirements, explicit constraints, likely failure modes, and meaningful quality differences from the instruction and both solutions

  22. [22]

    Draft criteria that are grounded in those concrete requirements and likely to influence overall preference

  23. [23]

    Remove any criterion that is vague, redundant, weakly judgeable, or unlikely to distinguish the two responses

  24. [24]

    Remove excess adequacy-only checks if they dominate the list

  25. [25]

    Verify each surviving criterion is atomic, judgeable, response-neutral, and useful for pairwise comparison

  26. [26]

    A"‘ -- Solution A clearly better satisfies this criterion - ‘

    Output JSON only E.4 Pairwise criterion judging Listing 6: Pairwise criterion judging prompt template. You are a code evaluation judge. Your task is to compare two candidate solutions (A and B) against a specific list of evaluation criteria. For each criterion, determine which solution better satisfies it based on the code implementations and execution re...

  27. [27]

    Do not treat Solution A as the default or reference answer

    Judge each criterion **symmetrically**. Do not treat Solution A as the default or reference answer

  28. [28]

    If the positions of A and B were swapped, the judgment should swap accordingly

  29. [29]

    Base the judgment on **comparative evidence**, not merely on whether each solution individually clears a minimum bar

  30. [30]

    Use ‘"tie"‘ only when the available evidence indicates the two solutions are genuinely comparable on this specific criterion

  31. [31]

    A"‘ when the difference is subtle. If the evidence slightly but meaningfully favors B, choose ‘

    Do NOT default to ‘"A"‘ when the difference is subtle. If the evidence slightly but meaningfully favors B, choose ‘"B"‘

  32. [32]

    If both solutions are flawed in different ways, still choose the one that better satisfies the criterion unless they are genuinely comparable

  33. [33]

    criterion_results

    Do not let your judgment on one criterion leak into another; evaluate each criterion independently. **Anti-bias requirement** - Be strictly position-invariant. - Do not assume A is better because it appears first. - Do not use style, verbosity, or answer order as a tiebreaker unless the criterion explicitly concerns those aspects. --- <|Instruction|> {INS...

  34. [34]

    Correctly implementing the requested functionality

  35. [35]

    Respecting explicit constraints and requirements

  36. [36]

    A"‘: Solution A is clearly better overall - ‘

    Avoiding important errors, omissions, or misleading behavior You may also consider efficiency, explainability, maintainability, and UI/UX when relevant, but these should not outweigh major correctness or requirement-fulfillment differences. **Human-alignment guidance (derived from analysis of human-vs-LLM preference disagreements):** {GUIDANCE} **Category...

  37. [37]

    Do not treat Solution A as the default or reference

    Evaluate the two responses **symmetrically**. Do not treat Solution A as the default or reference

  38. [38]

    Focus on the **most decisive differences**, not on counting superficial advantages

  39. [39]

    Many local ties or minor advantages do not necessarily imply an overall tie

    Do not mechanically follow the majority of per-criterion labels. Many local ties or minor advantages do not necessarily imply an overall tie

  40. [40]

    A small number of high-impact differences may outweigh many minor equalities

  41. [41]

    Use ‘"Tie"‘ only when the solutions are genuinely comparable in the aspects that matter most to the user’s request

  42. [42]

    If the positions of A and B were swapped, your overall judgment should swap accordingly

  43. [43]

    reasoning

    Do not favor A because it appears first, is longer, sounds more confident, or is more stylistically polished unless those qualities materially improve fulfillment of the user’s request. 28 **Input Format** <|Instruction|> {INSTRUCTION} <|The Start of Assistant A’s Answer|> {ANSWER_A}{SCREENSHOT_A_SECTION}{VISUAL_A_SECTION}<|The End of Assistant A’s Answer...

  44. [44]

    We only use existing benchmark annotations

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...