RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

Arman Cohan; Manasi Patwardhan; Owen Jiang; Sihong Wu; Tiansheng Hu; Yiling Ma; Yilun Zhao

arxiv: 2603.09723 · v2 · submitted 2026-03-10 · 💻 cs.CL · cs.AI

RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

Sihong Wu , Yiling Ma , Yilun Zhao , Tiansheng Hu , Owen Jiang , Manasi Patwardhan , Arman Cohan This is my paper

Pith reviewed 2026-05-15 13:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords peer reviewactionable feedbackrebuttal supervisionLLM fine-tuningfeedback generationpreference optimizationreview segments

0 comments

The pith

Rebuttals can supervise LLMs to generate more actionable peer review feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Peer review feedback from AI often lacks concrete guidance that authors can implement. The paper proposes using existing rebuttals, where authors respond to specific review points, as implicit supervision to train better feedback generators. It introduces a task where the model generates a single focused comment from a paper based on a perspective like experiments or writing. A dataset of 75,000 review-rebuttal mappings is built to support supervised fine-tuning and preference optimization on Llama-3.1-8B. Human and LLM evaluations show improvements in actionability and specificity while keeping comments grounded.

Core claim

By treating rebuttals as supervision signals that indicate which review segments prompted concrete author responses, the RbtAct method optimizes review feedback generation for actionability through perspective-conditioned segment-level tasks and preference pairs derived from author uptake in rebuttals.

What carries the argument

Rebuttal-derived preference pairs that map review segments to rebuttal segments addressing them, with impact categories that order author uptake for preference optimization.

If this is right

Feedback generators produce more specific and implementable comments.
The perspective-conditioned task focuses output on one aspect of the paper at a time.
Models maintain grounding and relevance to the original paper.
Gains appear consistently in both human expert ratings and LLM-as-a-judge evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Trained models could supply initial drafts that human reviewers edit rather than write from scratch.
The same supervision signal might improve feedback generation in grant proposals or code reviews.
The RMR-75K mappings could become a public benchmark for measuring actionability in review systems.
Repeated application might shorten revision cycles by giving authors clearer next steps earlier.

Load-bearing premise

Rebuttals reliably indicate which review segments were actionable because authors only engage with feedback concrete enough to respond to or implement.

What would settle it

A blind human evaluation in which experts rate feedback from the trained model as no more actionable than from strong baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.09723 by Arman Cohan, Manasi Patwardhan, Owen Jiang, Sihong Wu, Tiansheng Hu, Yiling Ma, Yilun Zhao.

**Figure 2.** Figure 2: Left: brief summary of perspective labels for review segments and impact categories for rebuttal segments. Right: normalized (100%) impact category composition by perspective. Review Perspective Labels. Each mapped review segment receives one label from a 7-category taxonomy: Experiments, Writing, Presentation, Theory, Novelty, Reproducibility, and Evaluation. We assign labels automatically with a rubric-… view at source ↗

**Figure 3.** Figure 3: A mapping example of our RMR-75K dataset. The review and rebuttal are from the paper titled “Large [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt used to segment the weaknesses and questions parts of the review into segments. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt used for mapping review segments with rebuttal segments. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used to label which perspective a review segment belongs to. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt used to label which impact category a rebuttal segment belongs to. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used to generate review segments by different LLM baselines. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: The interface of our human expert evaluation. The page contains 2 tasks: pairwise preference and [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison guidelines for the “Actionability” criterion. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison guidelines for the “Specificity” criterion. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison guidelines for the “Groundedness” criterion. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison guidelines for the “Relevance” criterion. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison guidelines for the “Helpfulness” criterion. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt used for point-wise evaluation for LLM-as-a-judge. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt used for pairwise evaluation for LLM-as-a-judge on Actionability. [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Pairwise win rates heatmaps by perspective (row beats column) of § [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Case study comparing review feedback on Actionability from Experiment, Presentation and Evaluation [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used across the scientific workflow, including to draft peer-review reports. However, many AI-generated reviews are superficial and insufficiently actionable, leaving authors without concrete, implementable guidance and motivating the gap this work addresses. We propose RbtAct, which targets actionable review feedback generation and places existing peer review rebuttal at the center of learning. Rebuttals show which reviewer comments led to concrete revisions or specific plans, and which were only defended. Building on this insight, we leverage rebuttal as implicit supervision to directly optimize a feedback generator for actionability. To support this objective, we propose a new task called perspective-conditioned segment-level review feedback generation, in which the model is required to produce a single focused comment based on the complete paper and a specified perspective such as experiments and writing. We also build a large dataset named RMR-75K that maps review segments to the rebuttal segments that address them, with perspective labels and impact categories that order author uptake. We then train the Llama-3.1-8B-Instruct model with supervised fine-tuning on review segments followed by preference optimization using rebuttal derived pairs. Experiments with human experts and LLM-as-a-judge show consistent gains in actionability and specificity over strong baselines while maintaining grounding and relevance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a new dataset from review-rebuttal pairs and uses it to fine-tune a model for segment-level actionable feedback, but the rebuttal uptake proxy for actionability looks noisy.

read the letter

The main point is that they treat rebuttals as direct supervision for training a feedback generator. They define a new task of perspective-conditioned segment-level review feedback generation, release the RMR-75K dataset that aligns review segments to rebuttal segments with impact labels, and train Llama-3.1-8B first with supervised fine-tuning then with preference optimization on rebuttal-derived pairs. Experiments report gains in actionability and specificity over baselines while keeping grounding and relevance, judged by both humans and LLM evaluators. The dataset construction is concrete and the use of existing rebuttal text as a training signal is a practical move that avoids needing new annotations from scratch. The central weakness is the assumption that author uptake in a rebuttal reliably marks a comment as actionable. Rebuttals often include defenses, clarifications, or refusals to revise even for weak points, and strong suggestions can be ignored for non-technical reasons. This noise could mean the model learns to produce rebuttal-style text rather than genuinely implementable feedback, which would inflate the measured gains without confirming the supervision works as intended. The abstract gives no numbers or error analysis, so it is hard to tell how large or robust the improvements actually are. This is mainly useful for people already working on AI tools for peer review or scientific writing in NLP. The dataset and task framing are solid enough to deserve a serious referee even if the evaluation of the proxy needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RbtAct, a framework that uses existing peer-review rebuttals as implicit supervision to train LLMs for generating actionable review feedback. It introduces the task of perspective-conditioned segment-level review feedback generation, constructs the RMR-75K dataset by mapping review segments to rebuttal segments with perspective labels and impact categories, and trains Llama-3.1-8B-Instruct first with supervised fine-tuning on review segments and then with preference optimization on rebuttal-derived pairs. Human-expert and LLM-as-a-judge evaluations are reported to show consistent gains in actionability and specificity over strong baselines while preserving grounding and relevance.

Significance. If the rebuttal-uptake proxy is reliable, the work offers a practical route to more implementable AI-generated reviews, a clear gap in current systems. The RMR-75K dataset is a substantial new resource, and the two-stage training pipeline (SFT + preference optimization) is straightforward to reproduce. Human evaluation adds credibility beyond automated metrics.

major comments (2)

[§3 (RMR-75K construction)] §3 (RMR-75K construction): The supervision signal rests on the assumption that author uptake in rebuttals is a valid proxy for the original review segment being actionable and implementable. Rebuttals frequently contain methodological defenses, clarifications, or refusals to revise that do not reflect actionability; conversely, actionable suggestions may be ignored for non-technical reasons. This noise could cause the model to optimize for rebuttal-style text rather than truly concrete feedback, directly affecting the central claim.
[§5 (Experiments and evaluation)] §5 (Experiments and evaluation): The reported gains in actionability and specificity are presented without accompanying numerical tables, exact baseline descriptions, statistical significance tests, or error analysis. Without these details it is impossible to judge whether the improvements are robust or sensitive to post-hoc choices in prompt design or judge instructions.

minor comments (2)

The abstract would be strengthened by including one or two key quantitative results (e.g., absolute actionability scores or delta over baselines) so readers can immediately gauge effect size.
[§4 (Training)] Clarify in §4 how the preference pairs are exactly constructed from impact categories (e.g., which categories are treated as positive vs. negative) and whether any filtering is applied to noisy mappings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and constructive major comments. We address each point below and will revise the manuscript to strengthen the presentation and discussion of limitations.

read point-by-point responses

Referee: [§3 (RMR-75K construction)] The supervision signal rests on the assumption that author uptake in rebuttals is a valid proxy for the original review segment being actionable and implementable. Rebuttals frequently contain methodological defenses, clarifications, or refusals to revise that do not reflect actionability; conversely, actionable suggestions may be ignored for non-technical reasons. This noise could cause the model to optimize for rebuttal-style text rather than truly concrete feedback, directly affecting the central claim.

Authors: We agree that rebuttal uptake is an imperfect proxy and that noise from defenses, clarifications, and non-technical refusals is a genuine concern. Our dataset construction attempts to reduce this by (1) mapping only segments that are explicitly addressed in the rebuttal and (2) ordering pairs by impact categories that reflect degree of author uptake. Nevertheless, we acknowledge that this does not eliminate all noise. In the revision we will expand §3 with a dedicated limitations paragraph that enumerates the main sources of noise, quantifies the fraction of rebuttal segments that contain explicit revision commitments versus defenses, and explains why perspective conditioning still yields measurable gains in downstream actionability. We will also add a qualitative error analysis in §5 that inspects generated feedback for rebuttal-style phrasing. revision: partial
Referee: [§5 (Experiments and evaluation)] The reported gains in actionability and specificity are presented without accompanying numerical tables, exact baseline descriptions, statistical significance tests, or error analysis. Without these details it is impossible to judge whether the improvements are robust or sensitive to post-hoc choices in prompt design or judge instructions.

Authors: We apologize for the insufficient detail in the submitted version. The revised manuscript will include: (i) full numerical tables with all metrics, (ii) exact baseline prompts and model versions, (iii) statistical significance tests (paired bootstrap and McNemar tests with p-values), and (iv) a new error-analysis subsection that examines sensitivity to judge instructions and prompt variations. These additions will make the experimental claims fully reproducible and allow readers to assess robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: external rebuttal supervision is independent of model internals

full rationale

The paper builds RMR-75K by mapping review segments to rebuttal segments drawn from external data, then applies standard SFT followed by preference optimization on rebuttal-derived pairs. The claimed gains in actionability are measured by separate human-expert and LLM-as-judge evaluations that do not reduce to any fitted parameter or self-defined quantity inside the generator. No equations, self-citations, or uniqueness theorems are invoked that would make the output equivalent to the input by construction. The supervision signal originates outside the model, satisfying the criterion for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption: that rebuttal text encodes reliable information about which review comments were actionable. No free parameters or invented entities are introduced beyond standard language-model training choices.

axioms (1)

domain assumption Rebuttals provide a valid implicit supervision signal for actionability of review comments
This assumption is invoked when the authors treat rebuttal segments as positive or negative examples for preference optimization.

pith-pipeline@v0.9.0 · 5556 in / 1217 out tokens · 35610 ms · 2026-05-15T13:17:57.190441+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Each point should be an independent, specific issue or weakness

work page
[2]

Preserve the core meaning of the original text without adding or removing information

work page
[3]

Maintain existing numbering structures if present (e.g., 1., 2., W1, W2, etc.)

work page
[4]

Handle various formatting styles including: - Numbered lists (1., 2., 3.) - Letter prefixes (W1, W2, Q1, Q2) - Markdown bullet points (-, *, +) - Section headers (## Weaknesses, ## Questions)

work page
[5]

If no clear numbering exists, logically segment based on content structure

work page
[6]

Each point should contain sufficient context to be understood independently

work page
[7]

We added/updated

Preserve the original language and terminology used by the reviewer [User Prompt] Please segment the following Weaknesses & Questions text into independent points: {weaknesses&questions text} IMPORTANT: Regardless of the input format (bullet points, numbered lists, paragraphs, etc.), you MUST output in this exact format: Point 1: [Complete content of the ...

work page 2024
[8]

Carefully analyze the rebuttal text to identify which sections respond to specific weaknesses

work page
[9]

Look for explicit references (W1, W2, Point 1, etc.) or implicit topical connections

work page
[10]

Extract the complete response content that addresses each weakness

work page
[11]

Assign confidence scores (0-1) based on the clarity and directness of the mapping

work page
[12]

Mark as ”No Response” if a weakness is not addressed in the rebuttal

work page
[13]

Be conservative with confidence scores - only use high scores (¿0.8) when the mapping is very clear

work page
[14]

Always copy the complete, verbatim text from the rebuttal for each weakness point, even if the same rebuttal section addresses multiple weaknesses

Preserve the exact wording from the rebuttal when extracting responses CRITICAL RULE - NO SHORTCUTS OR REFERENCES: You must NEVER use summarizing phrases or references like ”[Same content as W2 response]”, ”[Similar to above]”, ”[As mentioned earlier]”, etc. Always copy the complete, verbatim text from the rebuttal for each weakness point, even if the sam...

work page
[15]

This includes missing or insufficient experiments, lack of ablation studies, weak baseline comparisons, unclear descriptions of datasets, or issues with hyperparameter selection

**Experiments**: The reviewer is questioning experimental **setup and design**. This includes missing or insufficient experiments, lack of ablation studies, weak baseline comparisons, unclear descriptions of datasets, or issues with hyperparameter selection

work page
[16]

**Writing**: The reviewer is concerned about writing quality - grammar, clarity, readability, ambiguous phrasing, typos, missing definitions of symbols/terms, unclear explanations of concepts

work page
[17]

**Presentation**: The reviewer is critiquing presentation and organization - figures, tables, and organization issues, unclear plots, missing legends, poor formatting, misplaced content, overall paper structure making it hard to follow

work page
[18]

**Theory**: The reviewer is questioning theoretical aspects - incorrect mathematical derivations, flawed assumptions, weak theoretical justification, missing proofs, inconsistency between claims and formulas

work page
[19]

**Novelty**: The reviewer is questioning novelty and originality - lack of novelty or originality, overlap with prior work, incremental contribution, insufficient differentiation from existing methods

work page
[20]

**Reproducibility**: The reviewer is concerned about reproducibility - missing implementation details, absent code or pseudo-code, hyperparameters not specified, insufficient information to reproduce results

work page
[21]

This includes the use of inappropriate or missing evaluation metrics, insufficient analysis of results, or inconsistencies between reported results and the paper’s claims

**Evaluation**: The reviewer is concerned about how the experimental results are **measured, analyzed, and interpreted**. This includes the use of inappropriate or missing evaluation metrics, insufficient analysis of results, or inconsistencies between reported results and the paper’s claims

work page
[22]

Actionability

**Miscellaneous**: Content that is not a direct review point (weaknesses, questions, suggestions) about the paper. This includes polite remarks, Summative or transitional comments, summaries of the paper’s or review’s content, or irrelevant text. Please analyze the following review point and identify from which perspective the reviewer is raising their co...

work page
[23]

Vague remarks like ”improve experiments.”

Very poor: No concrete next step. Vague remarks like ”improve experiments.”

work page
[24]

No criteria for success

Poor: A possible step is implied but not described. No criteria for success

work page
[25]

Fair: At least one concrete suggestion, but incomplete or underspecified

work page
[26]

Good: Clear, feasible steps with some parameters or success criteria

work page
[27]

### Specificity (1-5)

Excellent: A short plan with steps, locations in the paper, parameters or tests, and what outcome would address the issue. ### Specificity (1-5)

work page
[28]

Very poor: Generic template text that could apply to any paper

work page
[29]

Poor: Mentions broad areas but no details

work page
[30]

Fair: Refers to a section, figure, dataset, or claim but stays broad

work page
[31]

Good: Points to exact sections, figures, metrics, or settings

work page
[32]

### Groundedness (1-5)

Excellent: Pinpoints precise passages or numbers and names exact variables, metrics, or ablation locations. ### Groundedness (1-5)

work page
[33]

Very poor: Speculative, incorrect, or contradicted by the paper

work page
[34]

Poor: Weak link to the paper; no verifiable reference

work page
[35]

Fair: Partly grounded with at least one reference to paper content

work page
[36]

Good: Well supported with references to specific content

work page
[37]

### Relevance (1-5)

Excellent: Strongly supported with exact identifiers or numbers from the paper (for example ”Table 2 shows 71.3 vs 71.1 and the claim of a large gain is not supported”). ### Relevance (1-5)

work page
[38]

Very poor: Off topic relative to the target perspective or the main paper issues

work page
[39]

Poor: Mostly off topic with minor relevant content

work page
[40]

Mixes relevant and irrelevant feedback

Fair: Partially aligned. Mixes relevant and irrelevant feedback

work page
[41]

Good: Mostly aligned with the target perspective

work page
[42]

### Helpfulness (1-5)

Excellent: Fully aligned with the target perspective and the paper’s main contributions. ### Helpfulness (1-5)

work page
[43]

Very poor: Unclear, hostile, or not useful

work page
[44]

Poor: Slightly useful but confusing or impractical

work page
[45]

Fair: Some useful content, needs refinement to be actionable

work page
[46]

Good: Clear, constructive, and practically useful

work page
[47]

Excellent: Directly helps the authors improve the paper with minimal ambiguity. **Paper Content:** paper content **Review Perspective:** perspective **Review Comment to Evaluate:** review text Please provide scores (1-5) for each dimension along with your reasoning. Be critical and precise in your evaluation. You MUST respond with a valid JSON object in t...

work page

[1] [1]

Each point should be an independent, specific issue or weakness

work page

[2] [2]

Preserve the core meaning of the original text without adding or removing information

work page

[3] [3]

Maintain existing numbering structures if present (e.g., 1., 2., W1, W2, etc.)

work page

[4] [4]

Handle various formatting styles including: - Numbered lists (1., 2., 3.) - Letter prefixes (W1, W2, Q1, Q2) - Markdown bullet points (-, *, +) - Section headers (## Weaknesses, ## Questions)

work page

[5] [5]

If no clear numbering exists, logically segment based on content structure

work page

[6] [6]

Each point should contain sufficient context to be understood independently

work page

[7] [7]

We added/updated

Preserve the original language and terminology used by the reviewer [User Prompt] Please segment the following Weaknesses & Questions text into independent points: {weaknesses&questions text} IMPORTANT: Regardless of the input format (bullet points, numbered lists, paragraphs, etc.), you MUST output in this exact format: Point 1: [Complete content of the ...

work page 2024

[8] [8]

Carefully analyze the rebuttal text to identify which sections respond to specific weaknesses

work page

[9] [9]

Look for explicit references (W1, W2, Point 1, etc.) or implicit topical connections

work page

[10] [10]

Extract the complete response content that addresses each weakness

work page

[11] [11]

Assign confidence scores (0-1) based on the clarity and directness of the mapping

work page

[12] [12]

Mark as ”No Response” if a weakness is not addressed in the rebuttal

work page

[13] [13]

Be conservative with confidence scores - only use high scores (¿0.8) when the mapping is very clear

work page

[14] [14]

Always copy the complete, verbatim text from the rebuttal for each weakness point, even if the same rebuttal section addresses multiple weaknesses

Preserve the exact wording from the rebuttal when extracting responses CRITICAL RULE - NO SHORTCUTS OR REFERENCES: You must NEVER use summarizing phrases or references like ”[Same content as W2 response]”, ”[Similar to above]”, ”[As mentioned earlier]”, etc. Always copy the complete, verbatim text from the rebuttal for each weakness point, even if the sam...

work page

[15] [15]

This includes missing or insufficient experiments, lack of ablation studies, weak baseline comparisons, unclear descriptions of datasets, or issues with hyperparameter selection

**Experiments**: The reviewer is questioning experimental **setup and design**. This includes missing or insufficient experiments, lack of ablation studies, weak baseline comparisons, unclear descriptions of datasets, or issues with hyperparameter selection

work page

[16] [16]

**Writing**: The reviewer is concerned about writing quality - grammar, clarity, readability, ambiguous phrasing, typos, missing definitions of symbols/terms, unclear explanations of concepts

work page

[17] [17]

**Presentation**: The reviewer is critiquing presentation and organization - figures, tables, and organization issues, unclear plots, missing legends, poor formatting, misplaced content, overall paper structure making it hard to follow

work page

[18] [18]

**Theory**: The reviewer is questioning theoretical aspects - incorrect mathematical derivations, flawed assumptions, weak theoretical justification, missing proofs, inconsistency between claims and formulas

work page

[19] [19]

**Novelty**: The reviewer is questioning novelty and originality - lack of novelty or originality, overlap with prior work, incremental contribution, insufficient differentiation from existing methods

work page

[20] [20]

**Reproducibility**: The reviewer is concerned about reproducibility - missing implementation details, absent code or pseudo-code, hyperparameters not specified, insufficient information to reproduce results

work page

[21] [21]

This includes the use of inappropriate or missing evaluation metrics, insufficient analysis of results, or inconsistencies between reported results and the paper’s claims

**Evaluation**: The reviewer is concerned about how the experimental results are **measured, analyzed, and interpreted**. This includes the use of inappropriate or missing evaluation metrics, insufficient analysis of results, or inconsistencies between reported results and the paper’s claims

work page

[22] [22]

Actionability

**Miscellaneous**: Content that is not a direct review point (weaknesses, questions, suggestions) about the paper. This includes polite remarks, Summative or transitional comments, summaries of the paper’s or review’s content, or irrelevant text. Please analyze the following review point and identify from which perspective the reviewer is raising their co...

work page

[23] [23]

Vague remarks like ”improve experiments.”

Very poor: No concrete next step. Vague remarks like ”improve experiments.”

work page

[24] [24]

No criteria for success

Poor: A possible step is implied but not described. No criteria for success

work page

[25] [25]

Fair: At least one concrete suggestion, but incomplete or underspecified

work page

[26] [26]

Good: Clear, feasible steps with some parameters or success criteria

work page

[27] [27]

### Specificity (1-5)

Excellent: A short plan with steps, locations in the paper, parameters or tests, and what outcome would address the issue. ### Specificity (1-5)

work page

[28] [28]

Very poor: Generic template text that could apply to any paper

work page

[29] [29]

Poor: Mentions broad areas but no details

work page

[30] [30]

Fair: Refers to a section, figure, dataset, or claim but stays broad

work page

[31] [31]

Good: Points to exact sections, figures, metrics, or settings

work page

[32] [32]

### Groundedness (1-5)

Excellent: Pinpoints precise passages or numbers and names exact variables, metrics, or ablation locations. ### Groundedness (1-5)

work page

[33] [33]

Very poor: Speculative, incorrect, or contradicted by the paper

work page

[34] [34]

Poor: Weak link to the paper; no verifiable reference

work page

[35] [35]

Fair: Partly grounded with at least one reference to paper content

work page

[36] [36]

Good: Well supported with references to specific content

work page

[37] [37]

### Relevance (1-5)

Excellent: Strongly supported with exact identifiers or numbers from the paper (for example ”Table 2 shows 71.3 vs 71.1 and the claim of a large gain is not supported”). ### Relevance (1-5)

work page

[38] [38]

Very poor: Off topic relative to the target perspective or the main paper issues

work page

[39] [39]

Poor: Mostly off topic with minor relevant content

work page

[40] [40]

Mixes relevant and irrelevant feedback

Fair: Partially aligned. Mixes relevant and irrelevant feedback

work page

[41] [41]

Good: Mostly aligned with the target perspective

work page

[42] [42]

### Helpfulness (1-5)

Excellent: Fully aligned with the target perspective and the paper’s main contributions. ### Helpfulness (1-5)

work page

[43] [43]

Very poor: Unclear, hostile, or not useful

work page

[44] [44]

Poor: Slightly useful but confusing or impractical

work page

[45] [45]

Fair: Some useful content, needs refinement to be actionable

work page

[46] [46]

Good: Clear, constructive, and practically useful

work page

[47] [47]

Excellent: Directly helps the authors improve the paper with minimal ambiguity. **Paper Content:** paper content **Review Perspective:** perspective **Review Comment to Evaluate:** review text Please provide scores (1-5) for each dimension along with your reasoning. Be critical and precise in your evaluation. You MUST respond with a valid JSON object in t...

work page