From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation

Chenyang Gu; Guoxiu He; Jiacheng Yao; Jiawei Liu; Jinquan Zheng; Pujun Zheng; Tianrui Guo; Wei Lu; Yong Huang

arxiv: 2603.17588 · v2 · pith:QM6VRVGTnew · submitted 2026-03-18 · 💻 cs.IR · cs.CL

From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation

Pujun Zheng , Jiacheng Yao , Jinquan Zheng , Chenyang Gu , Guoxiu He , Jiawei Liu , Yong Huang , Tianrui Guo

show 1 more author

Wei Lu

This is my paper

Pith reviewed 2026-05-21 11:05 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords LLM-based evaluationpaper rankingpairwise comparisonscollaborative rankingscientific paper assessmentgraph-based samplingrelative quality judgment

0 comments

The pith

LLM paper evaluators improve by 21.8% when they rank papers through pairwise comparisons rather than assigning absolute scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often evaluate papers by assigning absolute scores independently. This approach falters because score scales change across different conferences and criteria, causing models to learn narrow rules instead of general judgment. The paper proposes a comparison-native framework that trains models to compare pairs of papers directly and then combine those results into overall rankings. This leads to better performance and generalization, as shown by improvements on multiple datasets.

Core claim

By shifting from isolated absolute scoring to collaborative pairwise ranking, the CNPE framework lets LLMs develop more robust scholarly judgment. The framework samples informative paper pairs using a graph-based similarity ranking algorithm, fine-tunes the model with supervised learning and reinforcement learning using comparison-based rewards, and at inference aggregates pairwise preference signals into a global relative quality ranking. Experiments show an average 21.8% relative improvement over DeepReview-14B and good generalization to five new datasets.

What carries the argument

CNPE framework, which integrates comparison into data construction via graph-based similarity ranking for sampling pairs and model learning through supervised fine-tuning and reinforcement learning with comparison rewards, then aggregates pairwise comparisons at inference into a global ranking.

If this is right

Models trained this way should generalize better to new conferences and evaluation criteria because they learn relative judgments instead of absolute score patterns.
Evaluation becomes more stable since relative comparisons are less affected by shifts in score scales across time or venues.
The graph-based sampling method allows efficient selection of discriminative pairs without evaluating all possible combinations.
Reinforcement learning with comparison-based rewards can further refine the model's ability to distinguish paper quality.
The aggregation step turns local preferences into a consistent global ranking of papers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such comparison-based evaluators could be applied to other ranking tasks like grant proposals or peer review in different fields.
If the aggregation works reliably, it might reduce the need for human calibration of scores across different evaluation rounds.
Future work could test whether this approach scales to very large paper collections by optimizing the graph sampling.

Load-bearing premise

That turning many separate paper comparisons into one consistent overall ranking can be done reliably without losing key quality information.

What would settle it

Test the framework on a collection of papers with a known expert ranking and see if the model's aggregated ranking matches the experts more closely than models using absolute scores.

read the original abstract

Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently. However, since score scales vary across conferences, time periods, and evaluation criteria, models trained on absolute scores are prone to fitting narrow, context-specific rules rather than developing robust scholarly judgment. To overcome this limitation, we propose shifting paper evaluation from isolated scoring to collaborative ranking. In particular, we design a $\textbf{C}$omparison-$\textbf{N}$ative framework for $\textbf{P}$aper $\textbf{E}$valuation ($\textbf{CNPE}$), integrating comparison into both data construction and model learning. We first propose a graph-based similarity ranking algorithm to facilitate the sampling of more informative and discriminative paper pairs from a collection. We then enhance relative quality judgment through supervised fine-tuning and reinforcement learning with comparison-based rewards. At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking. Experimental results demonstrate that our framework achieves an average relative improvement of 21.8% over the strong baseline DeepReview-14B, while exhibiting robust generalization to five previously unseen datasets. Our code is available at https://github.com/ECNU-Text-Computing/ComparisonReview.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CNPE moves LLM paper evaluation to pairwise comparisons with graph-based sampling and RL rewards, reporting gains on new datasets but leaving aggregation stability under-specified.

read the letter

Colleague, the core move here is replacing absolute scores with a full comparison pipeline: graph-based similarity ranking to pick informative pairs, comparison rewards during both SFT and RL, then pairwise judgments aggregated into a global ranking at inference. They claim a 21.8% relative improvement over DeepReview-14B plus solid results on five unseen datasets. That generalization is the strongest piece of evidence they offer against narrow overfitting. The graph sampling plus dual-stage comparison training is the distinct technical step beyond the absolute baselines they cite, and releasing the code helps. The motivation around score-scale drift across conferences is straightforward and reasonable. The soft spot is the aggregation step. LLM pairwise judgments often contain cycles, and the abstract gives almost no detail on the exact method or any consistency checks. If that step is unstable, the reported gains could shrink or vary more than claimed. Without ablations on the sampling, full dataset stats, or significance numbers, it is hard to judge how much of the lift comes from the framework versus tuning choices. This is for researchers building automated review tools in IR and NLP. A reader working on preference learning or ranking for LLMs would pick up usable ideas. It has enough of a concrete method and cross-dataset results to merit peer review rather than a desk reject; referees should focus on the aggregation procedure and experimental controls.

Referee Report

1 major / 1 minor

Summary. The paper proposes a Comparison-Native framework for Paper Evaluation (CNPE) that shifts LLM-based scientific paper assessment from independent absolute scoring to pairwise collaborative ranking. It introduces a graph-based similarity ranking algorithm to sample informative paper pairs, applies supervised fine-tuning and reinforcement learning with comparison-based rewards during training, and at inference performs pairwise comparisons whose signals are aggregated into a global relative quality ranking. Experiments report an average 21.8% relative improvement over the DeepReview-14B baseline together with robust generalization across five previously unseen datasets; code is released publicly.

Significance. If the empirical claims hold, the work offers a principled response to the well-known problem of score-scale variability and context-specific overfitting in LLM paper evaluation. The comparison-native data construction, training regime, and cross-dataset generalization results constitute a substantive advance over prior absolute-scoring approaches. Public code release is a clear strength that supports reproducibility and follow-on research in information retrieval and scholarly NLP.

major comments (1)

[Inference procedure] Inference procedure (abstract and §4): the central claim that pairwise preference signals can be aggregated into a stable global ranking without significant information loss is load-bearing for both the 21.8% improvement and the generalization results, yet the manuscript provides no description of the aggregation algorithm, no consistency metric for intransitive LLM judgments, and no ablation or robustness checks against cycles. This omission prevents verification that the reported gains are not artifacts of an unstable ranking step.

minor comments (1)

[Abstract and §4] The abstract and method sections would benefit from an explicit statement of the number of papers, pair-sampling statistics, and statistical significance tests supporting the 21.8% figure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the CNPE framework. We address the major comment on the inference procedure below.

read point-by-point responses

Referee: [Inference procedure] Inference procedure (abstract and §4): the central claim that pairwise preference signals can be aggregated into a stable global ranking without significant information loss is load-bearing for both the 21.8% improvement and the generalization results, yet the manuscript provides no description of the aggregation algorithm, no consistency metric for intransitive LLM judgments, and no ablation or robustness checks against cycles. This omission prevents verification that the reported gains are not artifacts of an unstable ranking step.

Authors: We agree that the current manuscript describes the aggregation of pairwise preferences into a global ranking only at a high level in the abstract and Section 4, without sufficient algorithmic detail or supporting analyses. In the revised version we will expand Section 4 to provide a complete specification of the aggregation procedure, introduce a quantitative consistency metric for measuring intransitivity in the LLM-generated preferences, and add ablation experiments that evaluate ranking stability under varying levels of cyclic preferences. These changes will enable readers to verify that the reported improvements are not artifacts of an unstable aggregation step. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard ranking and RL methods applied to new task with independent validation

full rationale

The paper describes a CNPE framework that samples paper pairs via a graph-based similarity ranking algorithm, applies supervised fine-tuning and reinforcement learning with comparison-based rewards, and aggregates pairwise comparisons into a global ranking at inference. These steps use established techniques for preference aggregation and RL without reducing to self-definition or fitted inputs by construction. The 21.8% relative improvement and generalization claims are supported by experiments on five previously unseen datasets, supplying external grounding rather than tautological reduction to training quantities. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling are present in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about the transitivity and aggregability of pairwise preferences plus the utility of graph-based sampling for informative comparisons; no new free parameters or invented entities are introduced beyond the framework design itself.

axioms (2)

domain assumption Pairwise preference signals can be aggregated into a consistent global ranking
Invoked at inference when multiple comparisons are combined into one overall ranking.
domain assumption Comparison-based rewards in RL improve relative quality judgment over absolute scoring
Core premise of the supervised fine-tuning and reinforcement learning stages.

pith-pipeline@v0.9.0 · 5777 in / 1438 out tokens · 65244 ms · 2026-05-21T11:05:28.719002+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At inference, the model performs pairwise comparisons over sampled paper pairs and aggregates these preference signals into a global relative quality ranking... using the Bradley-Terry model
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use a comparison-based reward mechanism that issues rewards only when the model produces correct comparison outcomes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.