pith. sign in

arxiv: 2604.17141 · v2 · submitted 2026-04-18 · 💻 cs.CL

SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction

Pith reviewed 2026-05-10 06:10 UTC · model grok-4.3

classification 💻 cs.CL
keywords scientific impact predictionLLM benchmarkingmulti-task fine-tuningcontrastive learningresearch evaluationcitation predictionmulti-field benchmark
0
0 comments X

The pith

Multi-task fine-tuning on contrastive paper pairs lets 4B LLMs outperform 30B models on predicting scientific impact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SciImpact, a benchmark of 215,928 contrastive paper pairs across 19 fields that capture differences in scientific impact through citations, awards, media attention, patents, and artifacts. It evaluates 11 LLMs and finds that off-the-shelf models show high variability, but multi-task supervised fine-tuning allows smaller models to exceed the performance of much larger ones and closed-source systems. This matters because it offers a more comprehensive way to assess research influence beyond citations and shows that targeted training can make efficient models competitive for complex prediction tasks.

Core claim

The SciImpact benchmark, built from heterogeneous data sources and web crawling to form 215,928 contrastive paper pairs reflecting short-term and long-term impact differences, reveals that multi-task supervised fine-tuning consistently enables smaller LLMs (such as 4B parameter models) to markedly outperform much larger models (such as 30B) and surpass powerful closed-source LLMs like o4-mini on the task of scientific impact prediction.

What carries the argument

The SciImpact benchmark of contrastive paper pairs used for multi-task supervised fine-tuning of LLMs to predict multi-dimensional scientific impact.

If this is right

  • Smaller, fine-tuned LLMs become viable alternatives to larger models for evaluating research impact.
  • Performance of LLMs varies substantially across different impact dimensions and scientific fields.
  • The benchmark provides a challenging test that highlights limitations of off-the-shelf LLMs.
  • Both short-term indicators like best paper awards and long-term ones like Nobel Prizes can be modeled through these pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that curated contrastive data can be more important than model scale for certain reasoning tasks in science.
  • Similar benchmarks could be developed for other domains like technology or policy impact prediction.
  • Deploying these fine-tuned models might enable more scalable analysis of scientific literature for funding decisions or trend detection.

Load-bearing premise

The constructed contrastive paper pairs accurately capture unbiased and meaningful differences in scientific impact across the various dimensions and fields.

What would settle it

Demonstrating that the fine-tuned 4B models do not outperform the 30B models or closed-source LLMs on a new set of paper pairs would falsify the main result.

Figures

Figures reproduced from arXiv: 2604.17141 by Hangxiao Zhu, Ping Nie, Yuyu Zhang, Yu Zhang.

Figure 1
Figure 1. Figure 1: Performance of o4-mini, off-the-shelf Qwen3-4B, and supervised fine-tuned Qwen3-4B across the seven impact dimensions on SCIIMPACT. Super￾vised fine-tuning (SFT) substantially enhances a 4B open-weight model’s ability to predict scientific impact across all dimensions, enabling it to rival or surpass a stronger closed-source model. on scientific impact prediction has largely focused on citation count predi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SCIIMPACT benchmark curation pipeline, including candidate retrieval, impact labeling and pair generation, and filtering and quality control. Dimension Pair Construction Rule Citation y(A +) ≥ 10, y(A −) ≥ 10, y(A+) y(A−) ≥ 2 Award y(A +) = True, y(A −) = False Patent y(A +) ≥ 5, y(A −) ≥ 5, y(A+) y(A−) ≥ 2 Media y(A +) ≥ 5, y(A −) ≥ 5, y(A+) y(A−) ≥ 2 Code y(A +) ≥ 10, y(A −) ≥ 10, y(A+) y… view at source ↗
Figure 3
Figure 3. Figure 3: Text length distribution by dimension. Each [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Citation accuracy by publication year. The bar [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of o4-mini, off-the-shelf Qwen3-4B, and supervised fine-tuned Qwen3-4B across scientific fields on SCIIMPACT. SFT substantially en￾hances a 4B open-weight model’s ability to predict sci￾entific impact across all fields, enabling it to rival or surpass stronger closed-source models. B Prompts B.1 Citation System: You are an impartial judge deciding which of two research papers has more citations… view at source ↗
read the original abstract

The rapid growth of scientific literature calls for automated methods to assess and predict research impact. Prior work has largely focused on citation-based metrics, leaving limited evaluation of models' capability to reason about other impact dimensions. To this end, we introduce SciImpact, a large-scale, multi-dimensional benchmark for scientific impact prediction spanning 19 fields. SciImpact captures various forms of scientific influence, ranging from citation counts to award recognition, media attention, patent reference, and artifact adoption, by integrating heterogeneous data sources and targeted web crawling. It comprises 215,928 contrastive paper pairs reflecting meaningful impact differences in both short-term (e.g., Best Paper Award) and long-term settings (e.g., Nobel Prize). We evaluate 11 widely used large language models (LLMs) on SciImpact. Results show that off-the-shelf models exhibit substantial variability across dimensions and fields, while multi-task supervised fine-tuning consistently enables smaller LLMs (e.g., 4B) to markedly outperform much larger models (e.g., 30B) and surpass powerful closed-source LLMs (e.g., o4-mini). These results establish SciImpact as a challenging benchmark and demonstrate its value for multi-dimensional, multi-field scientific impact prediction. Our project homepage is https://flypig23.github.io/sciimpact-homepage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SciImpact, a large-scale multi-dimensional benchmark for scientific impact prediction spanning 19 fields. It comprises 215,928 contrastive paper pairs constructed by integrating heterogeneous data sources and targeted web crawling to capture differences across dimensions including citation counts, award recognition, media attention, patent references, and artifact adoption, in both short-term and long-term settings. The authors evaluate 11 LLMs, reporting substantial variability in off-the-shelf performance across dimensions and fields, while demonstrating that multi-task supervised fine-tuning enables smaller models (e.g., 4B) to outperform much larger models (e.g., 30B) and closed-source LLMs (e.g., o4-mini).

Significance. If the contrastive pairs reliably encode unbiased impact differences, the benchmark would be a valuable contribution by extending evaluation beyond citation metrics to multiple dimensions and fields, offering a challenging testbed for LLM reasoning about scientific influence. The reported fine-tuning result—that smaller models can surpass larger ones after multi-task SFT—would be noteworthy for highlighting efficient adaptation strategies, with the project homepage supporting reproducibility.

major comments (2)
  1. [Abstract] Abstract: The central claim that the 215,928 contrastive pairs 'reflect meaningful impact differences' rests on the construction process ('integrating heterogeneous data sources and targeted web crawling'), but the abstract provides no details on pair construction criteria, validation of labels, bias mitigation (e.g., field-specific coverage or English-language media bias), or statistical tests confirming the differences. This is load-bearing for all LLM evaluation and fine-tuning results.
  2. [Evaluation section] Evaluation section: The reported performance gains from multi-task SFT (smaller 4B models outperforming 30B and o4-mini) lack accompanying statistical significance tests, confidence intervals, or ablation on label noise, which is required to substantiate the cross-model comparisons given the heterogeneous label sources.
minor comments (1)
  1. [Abstract] The abstract refers to 'o4-mini'; clarify the exact model name and version for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and outline the revisions we plan to make to improve the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the 215,928 contrastive pairs 'reflect meaningful impact differences' rests on the construction process ('integrating heterogeneous data sources and targeted web crawling'), but the abstract provides no details on pair construction criteria, validation of labels, bias mitigation (e.g., field-specific coverage or English-language media bias), or statistical tests confirming the differences. This is load-bearing for all LLM evaluation and fine-tuning results.

    Authors: We agree that the abstract, constrained by length, does not provide these details. The full paper (Section 3) describes the pair construction criteria in depth, including how pairs are formed based on significant differences in impact metrics from integrated sources, label validation through a combination of automated checks and human annotation on a sample, bias mitigation via stratified sampling across 19 fields and inclusion of non-English sources where possible, and statistical tests (e.g., t-tests on metric differences) to confirm meaningful distinctions. To make this more accessible, we will revise the abstract to concisely mention the validation process, bias mitigation steps, and that statistical tests support the impact differences. This revision will be made in the next version. revision: yes

  2. Referee: [Evaluation section] Evaluation section: The reported performance gains from multi-task SFT (smaller 4B models outperforming 30B and o4-mini) lack accompanying statistical significance tests, confidence intervals, or ablation on label noise, which is required to substantiate the cross-model comparisons given the heterogeneous label sources.

    Authors: We acknowledge that the evaluation section would benefit from greater statistical rigor. Our reported results demonstrate consistent trends, but we did not include formal tests in the initial submission. In the revised manuscript, we will add statistical significance tests (such as McNemar's test or bootstrap resampling for accuracy differences) with p-values, and report 95% confidence intervals for all model performances. Furthermore, we will conduct and report an ablation study on label noise by introducing controlled noise levels to the training labels and observing the impact on fine-tuned model performance, as well as comparing results on subsets with higher-confidence labels. These additions will strengthen the substantiation of our claims regarding the effectiveness of multi-task SFT. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction or empirical evaluation

full rationale

The paper constructs SciImpact by integrating heterogeneous external data sources and targeted web crawling to produce 215,928 contrastive paper pairs across 19 fields and multiple impact dimensions. This construction step is independent of the subsequent LLM evaluations and multi-task SFT experiments. No equations, parameters, or derivations are present that reduce by construction to fitted inputs or self-referential definitions. The reported performance gains (smaller models outperforming larger ones after SFT) are empirical outcomes measured on the externally sourced benchmark rather than tautological results. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; main unstated premise is that integrated data sources and contrastive pairing produce valid impact labels. No free parameters or invented entities are described.

axioms (1)
  • domain assumption Contrastive paper pairs constructed from heterogeneous sources and web crawling accurately capture meaningful impact differences
    This underpins the 215,928 pairs used for evaluation and fine-tuning.

pith-pipeline@v0.9.0 · 5536 in / 1239 out tokens · 46815 ms · 2026-05-10T06:10:04.424283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    InInternational Symposium on String Processing and Information Retrieval, pages 107– 117

    Estimating number of citations using author reputation. InInternational Symposium on String Processing and Information Retrieval, pages 107– 117. Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, and Animesh Mukherjee. 2014. To- wards a stratified learning approach to predict future citation counts. InJCDL’14, pages 351–360. Yuxiao Dong, Re...

  2. [2]

    SPECTER: Document-level representation learning using citation-informed transformers

    Will this paper increase your h-index? sci- entific impact prediction. InWSDM’15, pages 149– 158. Yuxiao Dong, Hao Ma, Zhihong Shen, and Kuansan Wang. 2017. A century of science: Globalization of scientific collaborations, citations, and innovations. InKDD’17, pages 1437–1446. Lawrence D Fu and Constantin Aliferis. 2008. Mod- els for predicting and explai...

  3. [3]

    Ching Jin, Yifang Ma, and Brian Uzzi

    Predicting citation count of bioinformatics pa- pers within four years of publication.Bioinformatics, 25(24):3303–3309. Ching Jin, Yifang Ma, and Brian Uzzi. 2021. Scien- tific prizes and the extraordinary growth of scientific topics.Nature Communications, 12(1):5619. Bernard Koch, Emily Denton, Alex Hanna, and Ja- cob Gates Foster. 2021. Reduced, reused ...

  4. [4]

    From newborn to impact: Bias-aware citation prediction.arXiv preprint arXiv:2510.19246. MDPI. 2025. Mdpi awards. Mistral AI Team. 2025. Ministral 3: Strong edge-ready ai. Pandelis Mitsis. 2022. The nobel prize time gap. Humanities and Social Sciences Communications, 9(1):407. OpenAI. 2025. Openai o3 and o4-mini system card. Papers with Code. 2019. Links b...

  5. [5]

    1.2k stars

    Starin: An approach to predict the popularity of github repository. InInternational Conference of Pioneering Computer Scientists, Engineers and Educators, pages 258–273. Kiyan Rezaee, Morteza Ziabakhsh, Niloofar Nikfarjam, Mohammad M Ghassemi, Yazdan Rezaee Jouryabi, Sadegh Eskandari, and Reza Lashgari. 2025. Fos: A large-scale temporal graph benchmark fo...