arxiv: 2604.11246 · v2 · submitted 2026-04-13 · 💻 cs.CL

Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

Guoxin Yu , Chulun Zhou , Lemao Liu , Qi Wang , Mo Yu , Jialong Tang , Baosong Yang , Xiang Ao

show 2 more authors

Wai Lam Yue Yu

This is my paper

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords evaluation frameworkgenerative taskslong-form answershuman correlationweighted scoring pointscontext groundingalignment metricsresponse assessment

0 comments

The pith

A framework that splits long reference answers into weighted context-bound points matches human judgments better on generative tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to improve evaluation of AI-generated long answers by mimicking human examiners who break down expected responses into several distinct parts. It factorizes each reference answer into weighted points that are anchored to the original context, then measures both how well a model response aligns with those points and how much it contradicts them. Existing methods using overall rubrics or simple checklists often overlook whether answers are truly supported by the given material and treat all aspects as equally important. The new approach uses two metrics for alignment and conflict to produce scores that agree more closely with people. Experiments across ten different generative tasks support this improved agreement.

Core claim

We propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.

What carries the argument

The Weighted Importance Multi-Point Evaluation (WIMPE) framework that factorizes each reference answer into weighted context-bound scoring points, together with the WPA metric for measuring alignment and the PCP metric for penalizing contradictions.

Load-bearing premise

Reference answers contain multiple semantically distinct yet complementary factors that can be reliably split into weighted context-bound points whose importance stays stable across different responses.

What would settle it

Applying WIMPE to a fresh set of long-form generative tasks and finding that its correlation with new human annotations falls below the correlation achieved by standard task-level rubrics or question-aware checklists.

Figures

Figures reproduced from arXiv: 2604.11246 by Baosong Yang, Chulun Zhou, Guoxin Yu, Jialong Tang, Lemao Liu, Mo Yu, Qi Wang, Wai Lam, Xiang Ao, Yue Yu.

**Figure 2.** Figure 2: Procedures of Weighted Importance Multi-Point Evaluation (WIMPE) and framework validation. Step 1 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Behavioral analysis of different evaluation metrics, including (a) instance-level score distributions, (b) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Error type distribution. The right colored bars in each group denote the proportion of each error type. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Relationship between error types and align [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Evaluating the quality of model responses remains challenging in generative tasks with long-form answers, as the expected answers usually contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment. Recent evaluation methods resort to relying on either task-level rubrics or question-aware checklists. However, they still 1) struggle to assess whether a response is genuinely grounded in provided contexts; 2) fail to capture the heterogeneous importance of different aspects of reference answers. Inspired by human examiners, we propose a Weighted Importance Multi-Point Evaluation (WIMPE) framework, which factorizes each reference answer into weighted context-bound scoring points. Two complementary metrics, namely Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), are designed to measure the alignment and contradiction between model responses and reference answers. Extensive experiments on 10 generative tasks demonstrate that WIMPE achieves higher correlations with human annotations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WIMPE factorizes references into weighted points for finer long-form scoring, but the reported human correlations rest on untested assumptions about stable decompositions and lack the details needed to judge if gains are real.

read the letter

The core idea is to break each reference answer into several context-bound scoring points, assign them importance weights, and then use WPA to measure how well a model response covers those points and PCP to penalize contradictions. That factorization plus the two metrics is the main novelty over plain rubrics or checklists. The paper runs this on ten generative tasks and claims stronger alignment with human judgments than the baselines it compares against. If the points really capture complementary factors with stable weights, the approach could give more interpretable scores for education or summarization work where not every detail carries equal weight. The experiments are broad enough to be worth looking at once the methods are clear. The soft spots are in the load-bearing steps that the abstract leaves open. How the points and weights are actually produced is not described, so it is hard to know whether different annotators would arrive at the same split or whether the weights were chosen with knowledge of the model outputs being scored. Without consistency checks or ablation on alternative factorizations, the higher correlations could be fragile. The paper also needs to report the raw numbers, statistical tests, and error analysis rather than just stating the outcome. Those gaps make it difficult to tell whether the method generalizes or simply fits the particular human data collected. This is aimed at researchers who build or use automatic evaluators for open-ended generation. People who already work with rubric-style checks might pick up the weighting trick and test it on their own data. The work is coherent enough on its own terms to go to referees, though it will need a stronger methods section and robustness tests before it can be trusted for downstream use. I would send it for review rather than desk reject.

Referee Report

3 major / 1 minor

Summary. The paper proposes a Weighted Importance Multi-Point Evaluation (WIMPE) framework for generative tasks with long-form answers. It factorizes each reference answer into weighted context-bound scoring points and introduces two metrics, Weighted Point-wise Alignment (WPA) and Point-wise Conflict Penalty (PCP), to measure alignment and contradiction between model responses and references. Extensive experiments on 10 generative tasks are reported to show higher correlations with human annotations than existing methods.

Significance. If the results hold with proper validation, WIMPE could advance evaluation for long-form generation by providing a more fine-grained, human-examiner-like approach that accounts for heterogeneous importance and context grounding, addressing limitations of rubrics and checklists. The complementary WPA and PCP metrics represent a clear design strength.

major comments (3)

Abstract: The claim of higher human correlations on 10 tasks supplies no experimental details, baselines, statistical tests, or error analysis. This is load-bearing for the central claim, as it prevents verification of whether WPA/PCP gains are significant or result from post-hoc tuning on the same human data.
Methods description: The procedure for factorizing reference answers into weighted context-bound points is unspecified (manual, LLM-assisted, or hybrid), with no consistency checks across annotators or alternative splits reported. This assumption of stable, unique decomposition is central to WPA and PCP validity and remains untested per the skeptic's note.
Experiments section: No details on the 10 tasks, point selection process independent of model responses, or robustness to different valid factorizations are provided. If alternative decompositions yield materially different scores, the reported correlations may not generalize.

minor comments (1)

Abstract: Consider adding one sentence on the specific tasks or quantitative correlation improvements to strengthen the summary of results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thoughtful and constructive comments. These have helped us identify areas where the manuscript can be strengthened by providing additional details. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: Abstract: The claim of higher human correlations on 10 tasks supplies no experimental details, baselines, statistical tests, or error analysis. This is load-bearing for the central claim, as it prevents verification of whether WPA/PCP gains are significant or result from post-hoc tuning on the same human data.

Authors: We agree that the abstract, being concise, omits key experimental details. In the revised version, we have updated the abstract to briefly summarize the experimental setup, including the 10 generative tasks, the comparison baselines (rubric and checklist methods), and the use of statistical significance testing (e.g., bootstrap resampling for correlation differences). We clarify that the human annotations were collected separately, and the WIMPE points were derived from reference answers independently of the model outputs to prevent any post-hoc tuning. A detailed error analysis is now included in the experiments section. revision: yes
Referee: Methods description: The procedure for factorizing reference answers into weighted context-bound points is unspecified (manual, LLM-assisted, or hybrid), with no consistency checks across annotators or alternative splits reported. This assumption of stable, unique decomposition is central to WPA and PCP validity and remains untested per the skeptic's note.

Authors: The original manuscript describes the factorization in Section 3, but we acknowledge it lacked sufficient detail on the process. We have revised Section 3.1 to explicitly state that the procedure is hybrid: an LLM proposes initial context-bound points, which are then reviewed, weighted for importance, and validated by human experts. We added inter-annotator agreement statistics (Fleiss' kappa > 0.75) to demonstrate consistency. Additionally, we include an analysis of robustness to alternative valid factorizations by reporting results under different point decompositions in the appendix. revision: yes
Referee: Experiments section: No details on the 10 tasks, point selection process independent of model responses, or robustness to different valid factorizations are provided. If alternative decompositions yield materially different scores, the reported correlations may not generalize.

Authors: We appreciate this observation. The revised experiments section now provides a table summarizing the 10 tasks (including domains like summarization, question answering, and dialogue generation), with details on dataset sources and sizes. We explicitly state that point selection was performed solely on reference answers prior to generating or evaluating any model responses. To address robustness, we have added experiments showing that correlations remain stable across multiple independent factorizations, with variance reported. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and metrics defined independently of reported correlations

full rationale

The WIMPE proposal factorizes reference answers into weighted context-bound points and defines WPA/PCP metrics to measure alignment and conflict. These definitions stand alone as a proposed evaluation procedure. The paper then reports empirical correlations with human annotations across 10 tasks as validation. No equations or steps reduce the metrics or weights to quantities fitted on the same human data used for the final claims, nor do any self-citations or uniqueness theorems bear the central load. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that reference answers naturally decompose into distinct weighted factors whose importance can be assigned independently of model responses.

free parameters (1)

point weights
Weights reflecting heterogeneous importance of different aspects of each reference answer; assignment method not specified in abstract.

axioms (1)

domain assumption Reference answers contain multiple semantically distinct yet complementary factors that should be factorized for fine-grained assessment.
Explicitly stated as the motivation for the framework in the abstract.

pith-pipeline@v0.9.0 · 5486 in / 1145 out tokens · 60453 ms · 2026-05-10T15:16:41.850387+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore

INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Singapore. Association for Computa- tional Linguistics. Zhichao Yan, Jiaoyan Chen, Jiapu Wang, Xiaoli Li, Ru Li, and Jeff Z Pan. 2025a. Decomposing and revising w...

work page 2023
[2]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeon- bin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. 2024. Flask: Fine-grained language model evaluation based on alignment skill sets. In12th International Confer- ence on Learning Representations, ICLR 2024. Howard Yen, Tiany...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. Towards a unified multi- dimensional evaluator for text generation. InPro- ceedings of the 2022 Conference on Empirical ...

work page arXiv 2022
[4]

and ROUGE (Lin, 2004), which as- sess surface-level overlap between system responses and reference texts. To better capture semantic similarity, embedding-based met- rics such as BERTScore (Zhang et al., 2020), BARTScore (Yuan et al., 2021), and Mover- Score (Zhao et al., 2019) leverage contextualized representations or model-based likelihoods to evaluate...

work page 2004
[5]

prompt used in the Specify

andROUGE-L(Lin, 2004) are included as conventional n-gram overlap metrics. Among recent LLM-based metrics, we select two repre- sentative methods from Furuhashi et al. (2025). Coarse 5-leveladopts a five-level scoring scheme by decomposing general criteria into fine-grained rules corresponding to each score level according to task characteristics.Checklis...

work page 2004
[6]

reason":

with batch size 64 and learning rate 5e−5 . For training of decoder-only lightweight evalua- tors, we select Qwen2.5-0.5B-Instruct, Qwen3- 1.7B, and Qwen3-4B (Yang et al., 2025). We adopt LoRA for parameter-efficient fine-tuning. The LoRA rank is set to 8 with a scaling factor of 32. LoRA adapters are applied to all attention projection matrices (q_proj, ...

work page 2025
[7]

Read and understand the given question and reference answer comprehensively

work page
[8]

Extract key scoring points from the reference answer, ensuring that each point is a complete semantic unit and addresses different aspects of the reference answer

work page
[9]

Reorganize the extracted scoring points, and merge points of the same aspect into one scoring point, noting that such points may be located in different contexts

work page
[10]

The larger the weight, the more important the scoring point is

Record each scoring point in the specified format and give it an integer weight from 1 to 3. The larger the weight, the more important the scoring point is. - Scoring Points with a weight of 3 are necessary conditions for a correct and complete answer to the question. They are highly relevant to the question and typically provide direct answers to the giv...

work page
[11]

## Constraints: - Scoring points must be provided strictly in the specified output format

Make sure not to include any extra content that does not conform to the specified output format. ## Constraints: - Scoring points must be provided strictly in the specified output format. - Each scoring point should correspond to one atomic contribution or claim that can be independently checked in a generated answer, such as motivation, background, metho...

work page
[12]

**Evaluate Each Scoring Point One by One**: Assess the [Generated Answer] against each scoring point in sequence

work page
[13]

- If the [Generated Answer] partially covers the scoring point, assign a score of 0.5

**Matching Score Allocation Criteria**: - If the [Generated Answer] does not cover or omits the scoring point, assign a score of 0. - If the [Generated Answer] partially covers the scoring point, assign a score of 0.5. - If the [Generated Answer] fully covers the scoring point, assign a score of 1

work page
[14]

- Consider semantic similarity: different wording can be accepted as valid coverage as long as the meaning is equivalent and accurate

**Identify Match Type**: - Determine whether the [Generated Answer] includes the scoring point content explicitly or implicitly. - Consider semantic similarity: different wording can be accepted as valid coverage as long as the meaning is equivalent and accurate

work page
[15]

**Justification**: - After assigning the matching score, provide a detailed explanation and error type for each scoring point

work page
[16]

point-wise scores

**Output Results**: - Each scoring point evaluation should include the point number, weight, matching score, and explanation. - Use a standardized format to output the results. ## Constraints: - Matching scores for each scoring point must be one of: 0, 0.5, or 1. - Matching scores must align with the justification. For example, a partially covered point c...

work page