GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

Chani Jung; Hyunwoo Kim; Jimin Mun; Maarten Sap; Xuhui Zhou

arxiv: 2604.11924 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.CL

GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

Jimin Mun , Chani Jung , Xuhui Zhou , Hyunwoo Kim , Maarten Sap This is my paper

Pith reviewed 2026-05-10 16:20 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords constructive feedbackscientific paper reviewauthor responsesLLM trainingpreference optimizationfeedback validityactionable feedbackresearch assistance

0 comments

The pith

A training recipe uses author responses to reviewer comments to teach language models how to generate more valid and actionable feedback on scientific papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a dataset of papers with reviewer comments labeled as valid or actionable according to whether authors addressed them in their responses. It then applies a two-part training process of fine-tuning on the effective examples and preference optimization that ranks better feedback higher than weaker alternatives, including synthetic pairs. This produces feedback that aligns more closely with what actually helps authors improve their work. A sympathetic reader would care because the method shows a concrete path for language models to assist in the research process without removing human judgment. The work focuses on feedback that targets both the research content and its presentation.

Core claim

The central claim is that annotating reviewer feedback using signals from author responses, followed by fine-tuning on valid and actionable cases plus preference optimization over real and synthetic pairs, trains models to output constructive feedback that authors are more likely to find effective and to act on.

What carries the argument

The GoodPoint training recipe that extracts success signals from author responses to identify valid and actionable feedback, then combines supervised fine-tuning with preference optimization on real and synthetic pairs.

If this is right

Models trained this way generate feedback that aligns better with cases where authors made changes.
The resulting models reach higher performance in matching human judgments of feedback quality than other models of similar size.
Human authors rate the feedback as having greater practical value for improving their papers.
Mixing real author responses with synthetic preference pairs strengthens the model's ability to select high-quality feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems using this approach could supply preliminary feedback to authors while they are still drafting, before formal submission.
Collecting comparable response data from other research communities would show whether the method transfers beyond the original paper collection.
The technique suggests that interaction histories between authors and reviewers can serve as a general source for training collaborative AI tools.

Load-bearing premise

Author responses to reviewer comments accurately reflect the validity and actionability of the feedback without introducing systematic bias.

What would settle it

A side-by-side comparison in which independent experts label the same set of reviewer comments for validity and actionability, then models trained on the author-response labels are tested against the expert labels; substantial mismatch in outcomes would indicate the signals are unreliable.

Figures

Figures reproduced from arXiv: 2604.11924 by Chani Jung, Hyunwoo Kim, Jimin Mun, Maarten Sap, Xuhui Zhou.

**Figure 2.** Figure 2: Overview of GOODPOINT training strategy. We first fine-tune Qwen3-8B on successful (i.e., valid and actionable) feedback from ICLR 2020–2025. We then apply DPO using two types of preference pairs: (1) those contrasting varying proportions of unsuccessful items, and (2) those featuring successful items with synthetic corruptions targeting specificity, clarity, accuracy, prioritization, and supportive tone… view at source ↗

**Figure 3.** Figure 3: Mean scores for paper-specific grounding and prioritization across models with [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Author action and validity distributions from human evaluation. The top row [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: show distribution of quality scores across models and humans. We see that LLMs specifically have noticeably lower scores in paper-specific grounding and prioritization. Furthermore, accuracy is also relatively lower for LLM-generated feedback. Therefore, to address these LLM-specific failure modes not observed in human data, we utilize LLM-as-ajudge quality scoring to filter based on human average scores … view at source ↗

**Figure 6.** Figure 6: Human reviewer feedback author response distribution from entire [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of human consensus-based feedback evaluation with example feedback [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

While LLMs hold significant potential to transform scientific research, we advocate for their use to augment and empower researchers rather than to automate research without human oversight. To this end, we study constructive feedback generation, the task of producing targeted, actionable feedback that helps authors improve both their research and its presentation. In this work, we operationalize the effectiveness of feedback along two author-centric axes-validity and author action. We first curate GoodPoint-ICLR, a dataset of 19K ICLR papers with reviewer feedback annotated along both dimensions using author responses. Building on this, we introduce GoodPoint, a training recipe that leverages success signals from author responses through fine-tuning on valid and actionable feedback, together with preference optimization on both real and synthetic preference pairs. Our evaluation on a benchmark of 1.2K ICLR papers shows that a GoodPoint-trained Qwen3-8B improves the predicted success rate by 83.7% over the base model and sets a new state-of-the-art among LLMs of similar size in feedback matching on a golden human feedback set, even surpassing Gemini-3-flash in precision. We further validate these findings through an expert human study, demonstrating that GoodPoint consistently delivers higher practical value as perceived by authors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GoodPoint mines author responses for a 19k-paper dataset and trains models with clear benchmark gains, though the proxy labels carry real noise risk.

read the letter

The main thing to know is that this paper turns author replies to reviews into labels for valid and actionable feedback at scale. They built GoodPoint-ICLR from 19k ICLR papers this way, then fine-tune on the positive cases and add preference optimization over real and synthetic pairs. A Qwen3-8B model trained this way lifts predicted success rate by 83.7% on their 1.2k-paper test set, matches human feedback better than other similar-sized models, and even beats Gemini-3-flash on precision. An expert human study backs up that authors rate the outputs as more useful in practice.

Referee Report

1 major / 2 minor

Summary. The paper claims to advance constructive feedback generation for scientific papers by LLMs. It curates GoodPoint-ICLR, a dataset of 19K ICLR papers with reviewer comments annotated for validity and actionability using author responses. GoodPoint trains models via fine-tuning on positive signals and preference optimization on real/synthetic pairs. On a 1.2K ICLR benchmark, GoodPoint-tuned Qwen3-8B achieves 83.7% higher predicted success rate than base model, SOTA feedback matching among similar LLMs (surpassing Gemini-3-flash in precision), with confirmation via expert human study on practical value.

Significance. If the central results hold, the work is significant for developing practical LLM tools to assist researchers in improving their papers through targeted feedback. Notable strengths are the scale of the annotated dataset, the hybrid training approach combining supervised fine-tuning with preference optimization, the independent golden human feedback benchmark, and the expert human study providing author-perceived validation. These contribute to reproducible and empirically grounded progress in AI-augmented scientific writing.

major comments (1)

[§3] §3: The validity and actionability labels for the GoodPoint-ICLR dataset are derived directly from author responses without any described independent expert validation or agreement metrics on the proxy. Author responses may contain biases such as politeness or selective engagement, which could introduce systematic errors into the training data. This is load-bearing for the 83.7% improvement claim on the 1.2K benchmark, as the model may optimize for response patterns rather than true feedback quality. Although the human study offers supporting evidence, explicit validation of the annotation proxy would be required to fully substantiate the pipeline.

minor comments (2)

[Abstract] The abstract states an 83.7% improvement but does not define 'predicted success rate' or provide baseline values; including a brief definition or reference to the evaluation section would enhance readability.
[Evaluation] Clarify the relationship between the 19K GoodPoint-ICLR dataset and the 1.2K benchmark to confirm no data leakage in the reported SOTA results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The major comment raises an important point about our annotation methodology, which we address below with additional context and proposed revisions to the manuscript.

read point-by-point responses

Referee: The validity and actionability labels for the GoodPoint-ICLR dataset are derived directly from author responses without any described independent expert validation or agreement metrics on the proxy. Author responses may contain biases such as politeness or selective engagement, which could introduce systematic errors into the training data. This is load-bearing for the 83.7% improvement claim on the 1.2K benchmark, as the model may optimize for response patterns rather than true feedback quality. Although the human study offers supporting evidence, explicit validation of the annotation proxy would be required to fully substantiate the pipeline.

Authors: We appreciate the referee highlighting this aspect of our pipeline. The choice to derive labels from author responses is deliberate: authors are best positioned to assess whether reviewer feedback is valid (by confirming the identified issue) and actionable (by describing concrete revisions they will make). Our annotation extracts explicit signals such as acknowledgments of problems and stated revision plans, rather than relying on vague or polite language. This provides a direct, outcome-oriented signal unavailable from third-party annotators. We acknowledge potential biases like selective engagement or politeness; however, by prioritizing verifiable commitments over general thanks, we reduce their impact. The expert human study in §5, where authors rate GoodPoint feedback higher in practical value, offers supporting evidence that the proxy aligns with real utility. In the revised manuscript, we will expand §3 with a new subsection on annotation heuristics, explicit discussion of biases and mitigations, and inter-annotator consistency checks on a sampled subset of responses. We will also clarify that the 83.7% improvement is measured against the same proxy on held-out data, with the human study providing orthogonal validation. While a full independent expert re-labeling of the 19K training set was not performed due to scale, the proposed additions will better substantiate the approach. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline grounded by independent human validation

full rationale

The paper curates a 19K-paper dataset by annotating reviewer feedback using author responses as proxies for validity and actionability, then applies fine-tuning and preference optimization to train a feedback generator. Evaluation reports an 83.7% lift in predicted success rate on a 1.2K-paper benchmark, SOTA feedback matching on a separate golden human feedback set, and confirmation via an expert human study. No derivation step reduces by construction to the inputs: the training signals and success metrics are not definitionally identical, no equations equate fitted parameters to predictions, and no self-citation chain or uniqueness theorem is invoked to force the result. The presence of held-out benchmarks and external human grounding keeps the chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on domain assumptions about the informativeness of author responses and standard ML training assumptions.

free parameters (1)

Hyperparameters for fine-tuning and preference optimization
Standard LLM training parameters not detailed in abstract but required for the recipe.

axioms (1)

domain assumption Author responses provide reliable signals for the validity and actionability of reviewer feedback.
This is used to annotate the dataset and define success.

pith-pipeline@v0.9.0 · 5528 in / 1395 out tokens · 71944 ms · 2026-05-10T16:20:31.608194+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

12 Preprint

feedback text--- Rewrite the reviewer feedback as a clear, standalone unit, preserving the original text as closely as possible. 12 Preprint. Under review

work page
[2]

author response text--- Rewrite the author response as a clear, standalone unit, preserving tone and first-person pronouns

work page
[3]

validity--- Did the authors agree the feedback is valid? agreed by authors | rebutted by authors | unclear

work page
[4]

author action--- What did the authors commit to do? will revise | defer future work | point to existing content | no revision accept | no revision contest | no action other | unclear or no response

work page
[5]

dimensions--- Break down the feedback into three dimensions if present: Feed Up (goal), Feed Back (gap), Feed Forward (next steps)

work page
[6]

Proposition 2

aspects--- Identify emphasized aspects (e.g., Clarity and Presentation, Reproducibility, Novelty, Theoretical Soundness, etc.) Return output strictly as JSON following the prescribed schema. Full decision rules, label definitions, and output schema are omitted here for brevity. Input:{conversation text} C Feedback Corruption & Verification Feedback corrup...

work page 2025

[1] [1]

12 Preprint

feedback text--- Rewrite the reviewer feedback as a clear, standalone unit, preserving the original text as closely as possible. 12 Preprint. Under review

work page

[2] [2]

author response text--- Rewrite the author response as a clear, standalone unit, preserving tone and first-person pronouns

work page

[3] [3]

validity--- Did the authors agree the feedback is valid? agreed by authors | rebutted by authors | unclear

work page

[4] [4]

author action--- What did the authors commit to do? will revise | defer future work | point to existing content | no revision accept | no revision contest | no action other | unclear or no response

work page

[5] [5]

dimensions--- Break down the feedback into three dimensions if present: Feed Up (goal), Feed Back (gap), Feed Forward (next steps)

work page

[6] [6]

Proposition 2

aspects--- Identify emphasized aspects (e.g., Clarity and Presentation, Reproducibility, Novelty, Theoretical Soundness, etc.) Return output strictly as JSON following the prescribed schema. Full decision rules, label definitions, and output schema are omitted here for brevity. Input:{conversation text} C Feedback Corruption & Verification Feedback corrup...

work page 2025