CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging
Pith reviewed 2026-05-20 04:29 UTC · model grok-4.3
The pith
CriterAlign improves code preference judging accuracy from 60.4 percent to 66.3 percent by shifting to direct pairwise comparisons on each criterion plus guidance extracted from human preference gaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CriterAlign adapts rubric-based judging to pairwise preference evaluation through direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. It further introduces Human-Preference-Aligned Guidance synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge predictions, then injects this guidance into the criterion generator, criterion judge, and final judge. On BigCodeReward this raises accuracy of a Qwen2.5-VL-32B monolithic judge from 60.4 percent to 66.3 percent.
What carries the argument
CriterAlign, a criterion-centric framework that replaces pointwise scoring with direct pairwise judgments at the criterion level and augments the process with offline-synthesized Human-Preference-Aligned Guidance drawn from human-judge rationale gaps.
Load-bearing premise
The pairwise criterion judgments and offline-extracted guidance will generalize to new code tasks and human preference distributions without overfitting to the particular examples used for guidance synthesis.
What would settle it
Running the complete CriterAlign pipeline on a fresh code preference dataset drawn from different programming tasks and annotator pools and checking whether accuracy still exceeds the monolithic baseline by a similar margin.
Figures
read the original abstract
Pairwise human preference prediction is central to evaluating code-generation systems, where quality often depends on task-specific trade-offs beyond functional correctness. While rubric-based LLM judges improve interpretability by decomposing evaluation into explicit criteria, most existing pipelines remain pointwise: they score each response independently and derive preferences by comparing aggregated scores. We show that this design is poorly matched to pairwise code preference prediction and can underperform a strong monolithic judge. We propose CriterAlign, a criterion-centric framework that adapts rubric-based judging to pairwise preference evaluation through direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. We further introduce Human-Preference-Aligned Guidance (HPAG), synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge predictions, and injected into the criterion generator, criterion judge, and final judge. On BigCodeReward, CriterAlign improves a Qwen2.5-VL-32B monolithic judge from 60.4% to 66.3% accuracy, with ablations confirming the contributions of pairwise criterion design and HPAG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CriterAlign, a criterion-centric framework for pairwise code preference judging in code generation evaluation. It replaces pointwise rubric scoring with direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. The authors introduce Human-Preference-Aligned Guidance (HPAG), synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge outputs, then injected into the criterion generator, judge, and synthesis stage. On BigCodeReward, the method lifts accuracy of a Qwen2.5-VL-32B monolithic judge from 60.4% to 66.3%, with ablations attributing gains to the pairwise design and HPAG.
Significance. If the reported gains hold under proper generalization checks, the work could meaningfully advance interpretable LLM judges for code preferences by aligning evaluation more closely with human rationales at the criterion level. The explicit ablations and the offline HPAG construction are strengths that allow component-wise assessment. The result is potentially impactful for preference modeling in code generation, provided the guidance does not overfit to the training distribution.
major comments (2)
- [§4 and §3.3] §4 (Experiments) and §3.3 (HPAG construction): The 5.9-point accuracy lift on BigCodeReward rests on HPAG extracted from the same training examples used to tune the system. No held-out preference distribution, cross-dataset test (e.g., on HumanEval or CodeContests), or ablation that removes HPAG while keeping the training set fixed is reported. Without such evidence, it remains possible that the extracted guidance captures benchmark-specific labeler artifacts rather than generalizable rationale gaps, undermining the central claim that the criterion-centric architecture plus HPAG produces robust improvement.
- [§4.2] §4.2 (Ablation table): The ablations isolate pairwise criterion design and HPAG, yet the table does not report variance across multiple random seeds or dataset splits. Given that the headline result is a single-point accuracy comparison (60.4% → 66.3%), the absence of error bars or statistical significance tests makes it difficult to determine whether the observed lift is reliable or sensitive to the particular train/test partition of BigCodeReward.
minor comments (2)
- [§3.2] The description of swap-consistency filtering would benefit from a quantitative breakdown (e.g., fraction of pairs filtered and its effect on final accuracy) to clarify its contribution beyond the qualitative motivation.
- [§3.4] Notation for the final pairwise synthesis step could be made more explicit; a small equation or pseudocode block would help readers trace how criterion-level judgments are aggregated into the overall preference decision.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback, which identifies key areas for strengthening claims about generalization and statistical robustness. We respond to each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§4 and §3.3] §4 (Experiments) and §3.3 (HPAG construction): The 5.9-point accuracy lift on BigCodeReward rests on HPAG extracted from the same training examples used to tune the system. No held-out preference distribution, cross-dataset test (e.g., on HumanEval or CodeContests), or ablation that removes HPAG while keeping the training set fixed is reported. Without such evidence, it remains possible that the extracted guidance captures benchmark-specific labeler artifacts rather than generalizable rationale gaps, undermining the central claim that the criterion-centric architecture plus HPAG produces robust improvement.
Authors: We appreciate the referee highlighting the risk that HPAG may capture dataset-specific artifacts. The construction process in §3.3 extracts recurring rationale gaps between human preferences and monolithic judge outputs across the training examples, with the intent of identifying systematic patterns rather than instance-specific noise. Existing ablations in §4.2 already isolate HPAG by removing it while retaining the same training data and other components, showing a performance drop that supports its contribution. Nevertheless, we acknowledge the absence of cross-dataset testing. In the revised manuscript we will add evaluations on HumanEval and CodeContests using the same CriterAlign pipeline (with HPAG extracted only from BigCodeReward training data) to provide direct evidence of generalization beyond the original benchmark. revision: yes
-
Referee: [§4.2] §4.2 (Ablation table): The ablations isolate pairwise criterion design and HPAG, yet the table does not report variance across multiple random seeds or dataset splits. Given that the headline result is a single-point accuracy comparison (60.4% → 66.3%), the absence of error bars or statistical significance tests makes it difficult to determine whether the observed lift is reliable or sensitive to the particular train/test partition of BigCodeReward.
Authors: We agree that reporting variance and statistical measures would improve confidence in the results. The headline numbers and ablations were obtained from single runs, primarily due to the substantial computational cost of repeated inference with 32B-scale models. The ablation table does demonstrate consistent directional gains when components are added or removed. In the revised version we will rerun the key configurations (monolithic baseline, full CriterAlign, and the two main ablations) across three random seeds on the BigCodeReward test split and report means with standard deviations or confidence intervals. revision: yes
Circularity Check
No significant circularity; empirical benchmark gains remain independent of inputs.
full rationale
The paper describes an empirical framework (CriterAlign) that applies direct criterion-level pairwise judgments, tie-driven refinement, swap-consistency filtering, final synthesis, and offline-synthesized HPAG guidance extracted from training examples. The central result is a reported accuracy lift on the external BigCodeReward benchmark (60.4% to 66.3%). No equations, derivations, or self-referential definitions are present that would reduce any claimed prediction or result to its own fitted inputs by construction. HPAG synthesis is a one-time offline extraction step whose output is then used as guidance; this does not make the downstream accuracy score tautological or equivalent to the extraction process itself. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or described chain. The improvement is therefore presented as a self-contained empirical outcome against an external benchmark rather than a restatement of the method's own construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM judges can be improved by injecting human-preference rationale gaps extracted offline from training examples
invented entities (1)
-
HPAG (Human-Preference-Aligned Guidance)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis plus HPAG
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Human-Preference-Aligned Guidance (HPAG) synthesized offline from training examples by extracting recurring rationale gaps
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The user’s programming instruction
-
[2]
Both solutions (code + execution output + screenshots if available)
-
[3]
The human’s per-aspect votes (correctness, efficiency, explainability, maintainability, UI/UX design)
-
[4]
The human’s overall vote Your task: Synthesize a coherent rationale that explains the human’s overall preference, grounded in their aspect-level judgments. Where aspects disagree with the overall vote, explain the likely reasoning for why certain aspects were weighted more heavily. Think like a practical developer, not a formal reviewer. **Human Aspect Vo...
-
[5]
**Criterion generation**: An LLM generates 16-20 atomic evaluation criteria for the given task
-
[6]
**Per-criterion judging**: For each criterion, the LLM judges which solution is better (A/B/tie)
-
[7]
reverse-engineered human rationales
**Final judging**: The LLM makes an overall preference decision using the criterion judgments as evidence At runtime, a sample’s task category is known. Each of the three stage-LLMs will be shown the **global guidance for its stage concatenated with the category-specific guidance for its stage**. So for every category we need the same four kinds of guidan...
-
[8]
**Global guidance** (applies to every sample, cross-cutting patterns): same structure as before -- divergence patterns plus stage-specific guidance for criterion generation, per-criterion judging, and final judging. 21
-
[9]
**Per-category guidance** (applies only when the sample is of that category, category-distinctive patterns): for **each** of the six categories, produce the *same four kinds* of content as the global block -- divergence patterns, criterion-generation guidance, criterion-judging guidance, final-judging guidance -- at **comparable length** to the global ver...
-
[10]
**Atomic** -- one checkable claim, with no "and/or" combinations
- [11]
-
[12]
**Judgeable** -- a downstream model can decide it from the code, answer text, and execution evidence
-
[13]
**Task-relevant** -- tied to the instruction’s requirements, constraints, expected outputs, or user- valued qualities for this task
-
[14]
**Non-redundant** -- no near-duplicates or trivial rephrasings
-
[15]
**Comparative-useful** -- the criterion should capture a property on which the two responses could plausibly differ in a way that affects user preference **Important goal: prefer preference-driving criteria over adequacy-only checks.** - Prefer criteria that would actually help distinguish which solution is better. - Avoid criteria that both solutions are...
-
[16]
Whether the solution correctly fulfills the user’s actual request
-
[17]
Whether it handles important edge cases or failure modes relevant to this task
-
[18]
Whether it respects explicit constraints in the instruction
-
[19]
Whether one response is meaningfully more useful, complete, robust, or user-aligned
-
[20]
Only then consider secondary qualities like readability, comments, or maintainability, and only if they are likely to affect user preference here --- <|Instruction|> {INSTRUCTION} <|The Start of Assistant A’s Answer|> {ANSWER_A}{SCREENSHOT_A_SECTION}{VISUAL_A_SECTION}<|The End of Assistant A’s Answer|> <|The Start of Assistant B’s Answer|> {ANSWER_B}{SCRE...
-
[21]
Identify the key requirements, explicit constraints, likely failure modes, and meaningful quality differences from the instruction and both solutions
-
[22]
Draft criteria that are grounded in those concrete requirements and likely to influence overall preference
-
[23]
Remove any criterion that is vague, redundant, weakly judgeable, or unlikely to distinguish the two responses
-
[24]
Remove excess adequacy-only checks if they dominate the list
-
[25]
Verify each surviving criterion is atomic, judgeable, response-neutral, and useful for pairwise comparison
-
[26]
A"‘ -- Solution A clearly better satisfies this criterion - ‘
Output JSON only E.4 Pairwise criterion judging Listing 6: Pairwise criterion judging prompt template. You are a code evaluation judge. Your task is to compare two candidate solutions (A and B) against a specific list of evaluation criteria. For each criterion, determine which solution better satisfies it based on the code implementations and execution re...
-
[27]
Do not treat Solution A as the default or reference answer
Judge each criterion **symmetrically**. Do not treat Solution A as the default or reference answer
-
[28]
If the positions of A and B were swapped, the judgment should swap accordingly
-
[29]
Base the judgment on **comparative evidence**, not merely on whether each solution individually clears a minimum bar
-
[30]
Use ‘"tie"‘ only when the available evidence indicates the two solutions are genuinely comparable on this specific criterion
-
[31]
A"‘ when the difference is subtle. If the evidence slightly but meaningfully favors B, choose ‘
Do NOT default to ‘"A"‘ when the difference is subtle. If the evidence slightly but meaningfully favors B, choose ‘"B"‘
-
[32]
If both solutions are flawed in different ways, still choose the one that better satisfies the criterion unless they are genuinely comparable
-
[33]
Do not let your judgment on one criterion leak into another; evaluate each criterion independently. **Anti-bias requirement** - Be strictly position-invariant. - Do not assume A is better because it appears first. - Do not use style, verbosity, or answer order as a tiebreaker unless the criterion explicitly concerns those aspects. --- <|Instruction|> {INS...
-
[34]
Correctly implementing the requested functionality
-
[35]
Respecting explicit constraints and requirements
-
[36]
A"‘: Solution A is clearly better overall - ‘
Avoiding important errors, omissions, or misleading behavior You may also consider efficiency, explainability, maintainability, and UI/UX when relevant, but these should not outweigh major correctness or requirement-fulfillment differences. **Human-alignment guidance (derived from analysis of human-vs-LLM preference disagreements):** {GUIDANCE} **Category...
-
[37]
Do not treat Solution A as the default or reference
Evaluate the two responses **symmetrically**. Do not treat Solution A as the default or reference
-
[38]
Focus on the **most decisive differences**, not on counting superficial advantages
-
[39]
Many local ties or minor advantages do not necessarily imply an overall tie
Do not mechanically follow the majority of per-criterion labels. Many local ties or minor advantages do not necessarily imply an overall tie
-
[40]
A small number of high-impact differences may outweigh many minor equalities
-
[41]
Use ‘"Tie"‘ only when the solutions are genuinely comparable in the aspects that matter most to the user’s request
-
[42]
If the positions of A and B were swapped, your overall judgment should swap accordingly
-
[43]
Do not favor A because it appears first, is longer, sounds more confident, or is more stylistically polished unless those qualities materially improve fulfillment of the user’s request. 28 **Input Format** <|Instruction|> {INSTRUCTION} <|The Start of Assistant A’s Answer|> {ANSWER_A}{SCREENSHOT_A_SECTION}{VISUAL_A_SECTION}<|The End of Assistant A’s Answer...
-
[44]
We only use existing benchmark annotations
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.