SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction
Pith reviewed 2026-05-10 06:10 UTC · model grok-4.3
The pith
Multi-task fine-tuning on contrastive paper pairs lets 4B LLMs outperform 30B models on predicting scientific impact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The SciImpact benchmark, built from heterogeneous data sources and web crawling to form 215,928 contrastive paper pairs reflecting short-term and long-term impact differences, reveals that multi-task supervised fine-tuning consistently enables smaller LLMs (such as 4B parameter models) to markedly outperform much larger models (such as 30B) and surpass powerful closed-source LLMs like o4-mini on the task of scientific impact prediction.
What carries the argument
The SciImpact benchmark of contrastive paper pairs used for multi-task supervised fine-tuning of LLMs to predict multi-dimensional scientific impact.
If this is right
- Smaller, fine-tuned LLMs become viable alternatives to larger models for evaluating research impact.
- Performance of LLMs varies substantially across different impact dimensions and scientific fields.
- The benchmark provides a challenging test that highlights limitations of off-the-shelf LLMs.
- Both short-term indicators like best paper awards and long-term ones like Nobel Prizes can be modeled through these pairs.
Where Pith is reading between the lines
- This suggests that curated contrastive data can be more important than model scale for certain reasoning tasks in science.
- Similar benchmarks could be developed for other domains like technology or policy impact prediction.
- Deploying these fine-tuned models might enable more scalable analysis of scientific literature for funding decisions or trend detection.
Load-bearing premise
The constructed contrastive paper pairs accurately capture unbiased and meaningful differences in scientific impact across the various dimensions and fields.
What would settle it
Demonstrating that the fine-tuned 4B models do not outperform the 30B models or closed-source LLMs on a new set of paper pairs would falsify the main result.
Figures
read the original abstract
The rapid growth of scientific literature calls for automated methods to assess and predict research impact. Prior work has largely focused on citation-based metrics, leaving limited evaluation of models' capability to reason about other impact dimensions. To this end, we introduce SciImpact, a large-scale, multi-dimensional benchmark for scientific impact prediction spanning 19 fields. SciImpact captures various forms of scientific influence, ranging from citation counts to award recognition, media attention, patent reference, and artifact adoption, by integrating heterogeneous data sources and targeted web crawling. It comprises 215,928 contrastive paper pairs reflecting meaningful impact differences in both short-term (e.g., Best Paper Award) and long-term settings (e.g., Nobel Prize). We evaluate 11 widely used large language models (LLMs) on SciImpact. Results show that off-the-shelf models exhibit substantial variability across dimensions and fields, while multi-task supervised fine-tuning consistently enables smaller LLMs (e.g., 4B) to markedly outperform much larger models (e.g., 30B) and surpass powerful closed-source LLMs (e.g., o4-mini). These results establish SciImpact as a challenging benchmark and demonstrate its value for multi-dimensional, multi-field scientific impact prediction. Our project homepage is https://flypig23.github.io/sciimpact-homepage/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SciImpact, a large-scale multi-dimensional benchmark for scientific impact prediction spanning 19 fields. It comprises 215,928 contrastive paper pairs constructed by integrating heterogeneous data sources and targeted web crawling to capture differences across dimensions including citation counts, award recognition, media attention, patent references, and artifact adoption, in both short-term and long-term settings. The authors evaluate 11 LLMs, reporting substantial variability in off-the-shelf performance across dimensions and fields, while demonstrating that multi-task supervised fine-tuning enables smaller models (e.g., 4B) to outperform much larger models (e.g., 30B) and closed-source LLMs (e.g., o4-mini).
Significance. If the contrastive pairs reliably encode unbiased impact differences, the benchmark would be a valuable contribution by extending evaluation beyond citation metrics to multiple dimensions and fields, offering a challenging testbed for LLM reasoning about scientific influence. The reported fine-tuning result—that smaller models can surpass larger ones after multi-task SFT—would be noteworthy for highlighting efficient adaptation strategies, with the project homepage supporting reproducibility.
major comments (2)
- [Abstract] Abstract: The central claim that the 215,928 contrastive pairs 'reflect meaningful impact differences' rests on the construction process ('integrating heterogeneous data sources and targeted web crawling'), but the abstract provides no details on pair construction criteria, validation of labels, bias mitigation (e.g., field-specific coverage or English-language media bias), or statistical tests confirming the differences. This is load-bearing for all LLM evaluation and fine-tuning results.
- [Evaluation section] Evaluation section: The reported performance gains from multi-task SFT (smaller 4B models outperforming 30B and o4-mini) lack accompanying statistical significance tests, confidence intervals, or ablation on label noise, which is required to substantiate the cross-model comparisons given the heterogeneous label sources.
minor comments (1)
- [Abstract] The abstract refers to 'o4-mini'; clarify the exact model name and version for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and outline the revisions we plan to make to improve the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the 215,928 contrastive pairs 'reflect meaningful impact differences' rests on the construction process ('integrating heterogeneous data sources and targeted web crawling'), but the abstract provides no details on pair construction criteria, validation of labels, bias mitigation (e.g., field-specific coverage or English-language media bias), or statistical tests confirming the differences. This is load-bearing for all LLM evaluation and fine-tuning results.
Authors: We agree that the abstract, constrained by length, does not provide these details. The full paper (Section 3) describes the pair construction criteria in depth, including how pairs are formed based on significant differences in impact metrics from integrated sources, label validation through a combination of automated checks and human annotation on a sample, bias mitigation via stratified sampling across 19 fields and inclusion of non-English sources where possible, and statistical tests (e.g., t-tests on metric differences) to confirm meaningful distinctions. To make this more accessible, we will revise the abstract to concisely mention the validation process, bias mitigation steps, and that statistical tests support the impact differences. This revision will be made in the next version. revision: yes
-
Referee: [Evaluation section] Evaluation section: The reported performance gains from multi-task SFT (smaller 4B models outperforming 30B and o4-mini) lack accompanying statistical significance tests, confidence intervals, or ablation on label noise, which is required to substantiate the cross-model comparisons given the heterogeneous label sources.
Authors: We acknowledge that the evaluation section would benefit from greater statistical rigor. Our reported results demonstrate consistent trends, but we did not include formal tests in the initial submission. In the revised manuscript, we will add statistical significance tests (such as McNemar's test or bootstrap resampling for accuracy differences) with p-values, and report 95% confidence intervals for all model performances. Furthermore, we will conduct and report an ablation study on label noise by introducing controlled noise levels to the training labels and observing the impact on fine-tuned model performance, as well as comparing results on subsets with higher-confidence labels. These additions will strengthen the substantiation of our claims regarding the effectiveness of multi-task SFT. revision: yes
Circularity Check
No circularity in benchmark construction or empirical evaluation
full rationale
The paper constructs SciImpact by integrating heterogeneous external data sources and targeted web crawling to produce 215,928 contrastive paper pairs across 19 fields and multiple impact dimensions. This construction step is independent of the subsequent LLM evaluations and multi-task SFT experiments. No equations, parameters, or derivations are present that reduce by construction to fitted inputs or self-referential definitions. The reported performance gains (smaller models outperforming larger ones after SFT) are empirical outcomes measured on the externally sourced benchmark rather than tautological results. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contrastive paper pairs constructed from heterogeneous sources and web crawling accurately capture meaningful impact differences
Reference graph
Works this paper leans on
-
[1]
InInternational Symposium on String Processing and Information Retrieval, pages 107– 117
Estimating number of citations using author reputation. InInternational Symposium on String Processing and Information Retrieval, pages 107– 117. Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, and Animesh Mukherjee. 2014. To- wards a stratified learning approach to predict future citation counts. InJCDL’14, pages 351–360. Yuxiao Dong, Re...
work page 2014
-
[2]
SPECTER: Document-level representation learning using citation-informed transformers
Will this paper increase your h-index? sci- entific impact prediction. InWSDM’15, pages 149– 158. Yuxiao Dong, Hao Ma, Zhihong Shen, and Kuansan Wang. 2017. A century of science: Globalization of scientific collaborations, citations, and innovations. InKDD’17, pages 1437–1446. Lawrence D Fu and Constantin Aliferis. 2008. Mod- els for predicting and explai...
-
[3]
Ching Jin, Yifang Ma, and Brian Uzzi
Predicting citation count of bioinformatics pa- pers within four years of publication.Bioinformatics, 25(24):3303–3309. Ching Jin, Yifang Ma, and Brian Uzzi. 2021. Scien- tific prizes and the extraordinary growth of scientific topics.Nature Communications, 12(1):5619. Bernard Koch, Emily Denton, Alex Hanna, and Ja- cob Gates Foster. 2021. Reduced, reused ...
work page 2021
-
[4]
From newborn to impact: Bias-aware citation prediction.arXiv preprint arXiv:2510.19246. MDPI. 2025. Mdpi awards. Mistral AI Team. 2025. Ministral 3: Strong edge-ready ai. Pandelis Mitsis. 2022. The nobel prize time gap. Humanities and Social Sciences Communications, 9(1):407. OpenAI. 2025. Openai o3 and o4-mini system card. Papers with Code. 2019. Links b...
-
[5]
Starin: An approach to predict the popularity of github repository. InInternational Conference of Pioneering Computer Scientists, Engineers and Educators, pages 258–273. Kiyan Rezaee, Morteza Ziabakhsh, Niloofar Nikfarjam, Mohammad M Ghassemi, Yazdan Rezaee Jouryabi, Sadegh Eskandari, and Reza Lashgari. 2025. Fos: A large-scale temporal graph benchmark fo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.